* [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
@ 2012-07-25 8:29 Wang Sen
2012-07-25 8:44 ` Paolo Bonzini
2012-07-25 10:04 ` Rolf Eike Beer
0 siblings, 2 replies; 34+ messages in thread
From: Wang Sen @ 2012-07-25 8:29 UTC (permalink / raw)
To: linux-scsi; +Cc: JBottomley, pbonzini, stefanha, mc, linux-kernel, Wang Sen
When using the commands below to write some data to a virtio-scsi LUN of the
QEMU guest(32-bit) with 1G physical memory(qemu -m 1024), the qemu will crash.
# sudo mkfs.ext4 /dev/sdb (/dev/sdb is the virtio-scsi LUN.)
# sudo mount /dev/sdb /mnt
# dd if=/dev/zero of=/mnt/file bs=1M count=1024
In current implementation, sg_set_buf is called to add buffers to sg list which
is put into the virtqueue eventually. But there are some HighMem pages in
table->sgl can not get virtual address by sg_virt. So, sg_virt(sg_elem) may
return NULL value. This will cause QEMU exit when virtqueue_map_sg is called
in QEMU because an invalid GPA is passed by virtqueue.
My solution is using sg_set_page instead of sg_set_buf.
I have tested the patch on my workstation. QEMU would not crash any more.
Signed-off-by: Wang Sen <senwang@linux.vnet.ibm.com>
---
drivers/scsi/virtio_scsi.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 1b38431..fc5c88a 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -198,7 +198,8 @@ static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
int i;
for_each_sg(table->sgl, sg_elem, table->nents, i)
- sg_set_buf(&sg[idx++], sg_virt(sg_elem), sg_elem->length);
+ sg_set_page(&sg[idx++], sg_page(sg_elem), sg_elem->length,
+ sg_elem->offset);
*p_idx = idx;
}
--
1.7.9.5
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 8:29 [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list Wang Sen
@ 2012-07-25 8:44 ` Paolo Bonzini
2012-07-25 9:22 ` Boaz Harrosh
` (2 more replies)
2012-07-25 10:04 ` Rolf Eike Beer
1 sibling, 3 replies; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-25 8:44 UTC (permalink / raw)
To: Wang Sen; +Cc: linux-scsi, JBottomley, stefanha, mc, linux-kernel
Il 25/07/2012 10:29, Wang Sen ha scritto:
> When using the commands below to write some data to a virtio-scsi LUN of the
> QEMU guest(32-bit) with 1G physical memory(qemu -m 1024), the qemu will crash.
>
> # sudo mkfs.ext4 /dev/sdb (/dev/sdb is the virtio-scsi LUN.)
> # sudo mount /dev/sdb /mnt
> # dd if=/dev/zero of=/mnt/file bs=1M count=1024
>
> In current implementation, sg_set_buf is called to add buffers to sg list which
> is put into the virtqueue eventually. But there are some HighMem pages in
> table->sgl can not get virtual address by sg_virt. So, sg_virt(sg_elem) may
> return NULL value. This will cause QEMU exit when virtqueue_map_sg is called
> in QEMU because an invalid GPA is passed by virtqueue.
Heh, I was compiling (almost) the same patch as we speak. :)
I've never seen QEMU crash; the VM would more likely just fail to boot
with a panic. But it's the same bug anyway.
> My solution is using sg_set_page instead of sg_set_buf.
>
> I have tested the patch on my workstation. QEMU would not crash any more.
>
> Signed-off-by: Wang Sen <senwang@linux.vnet.ibm.com>
> ---
> drivers/scsi/virtio_scsi.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index 1b38431..fc5c88a 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -198,7 +198,8 @@ static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
> int i;
>
> for_each_sg(table->sgl, sg_elem, table->nents, i)
> - sg_set_buf(&sg[idx++], sg_virt(sg_elem), sg_elem->length);
> + sg_set_page(&sg[idx++], sg_page(sg_elem), sg_elem->length,
> + sg_elem->offset);
This can simply be
sg[idx++] = *sg_elem;
Can you repost it with this change, and also add stable@vger.kernel.org
to the Cc? Thanks very much!
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 8:44 ` Paolo Bonzini
@ 2012-07-25 9:22 ` Boaz Harrosh
2012-07-25 9:41 ` Paolo Bonzini
2012-07-25 10:41 ` [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list Stefan Hajnoczi
2012-07-25 11:44 ` Sen Wang
2 siblings, 1 reply; 34+ messages in thread
From: Boaz Harrosh @ 2012-07-25 9:22 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel
On 07/25/2012 11:44 AM, Paolo Bonzini wrote:
> Il 25/07/2012 10:29, Wang Sen ha scritto:
>> When using the commands below to write some data to a virtio-scsi LUN of the
>> QEMU guest(32-bit) with 1G physical memory(qemu -m 1024), the qemu will crash.
>>
>> # sudo mkfs.ext4 /dev/sdb (/dev/sdb is the virtio-scsi LUN.)
>> # sudo mount /dev/sdb /mnt
>> # dd if=/dev/zero of=/mnt/file bs=1M count=1024
>>
>> In current implementation, sg_set_buf is called to add buffers to sg list which
>> is put into the virtqueue eventually. But there are some HighMem pages in
>> table->sgl can not get virtual address by sg_virt. So, sg_virt(sg_elem) may
>> return NULL value. This will cause QEMU exit when virtqueue_map_sg is called
>> in QEMU because an invalid GPA is passed by virtqueue.
>
> Heh, I was compiling (almost) the same patch as we speak. :)
>
> I've never seen QEMU crash; the VM would more likely just fail to boot
> with a panic. But it's the same bug anyway.
>
>> My solution is using sg_set_page instead of sg_set_buf.
>>
>> I have tested the patch on my workstation. QEMU would not crash any more.
>>
>> Signed-off-by: Wang Sen <senwang@linux.vnet.ibm.com>
>> ---
>> drivers/scsi/virtio_scsi.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
>> index 1b38431..fc5c88a 100644
>> --- a/drivers/scsi/virtio_scsi.c
>> +++ b/drivers/scsi/virtio_scsi.c
>> @@ -198,7 +198,8 @@ static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
>> int i;
>>
>> for_each_sg(table->sgl, sg_elem, table->nents, i)
>> - sg_set_buf(&sg[idx++], sg_virt(sg_elem), sg_elem->length);
>> + sg_set_page(&sg[idx++], sg_page(sg_elem), sg_elem->length,
>> + sg_elem->offset);
>
> This can simply be
>
> sg[idx++] = *sg_elem;
>
> Can you repost it with this change, and also add stable@vger.kernel.org
> to the Cc? Thanks very much!
>
No! Please use sg_set_page()! Look at sg_set_page(), which calls sg_assign_page().
It has all these jump over chained arrays. When you'll start using long
sg_lists (which you should) then jumping from chain to chain must go through
sg_page(sg_elem) && sg_assign_page(), As in the original patch.
Thanks
Boaz
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 9:22 ` Boaz Harrosh
@ 2012-07-25 9:41 ` Paolo Bonzini
2012-07-25 12:34 ` Boaz Harrosh
0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-25 9:41 UTC (permalink / raw)
To: Boaz Harrosh; +Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel
Il 25/07/2012 11:22, Boaz Harrosh ha scritto:
>>> >> for_each_sg(table->sgl, sg_elem, table->nents, i)
>>> >> - sg_set_buf(&sg[idx++], sg_virt(sg_elem), sg_elem->length);
>>> >> + sg_set_page(&sg[idx++], sg_page(sg_elem), sg_elem->length,
>>> >> + sg_elem->offset);
>> >
>> > This can simply be
>> >
>> > sg[idx++] = *sg_elem;
>> >
>> > Can you repost it with this change, and also add stable@vger.kernel.org
>> > to the Cc? Thanks very much!
>> >
>
> No! Please use sg_set_page()! Look at sg_set_page(), which calls sg_assign_page().
> It has all these jump over chained arrays. When you'll start using long
> sg_lists (which you should) then jumping from chain to chain must go through
> sg_page(sg_elem) && sg_assign_page(), As in the original patch.
Hi Boaz,
actually it seems to me that using sg_set_page is wrong, because it will
not copy the end marker from table->sgl to sg[]. If something chained
the sg[] scatterlist onto something else, sg_next's test for sg_is_last
would go beyond the table->nents-th item and access invalid memory.
Using chained sglists is on my to-do list, I expect that it would make a
nice performance improvement. However, I was a bit confused as to
what's the plan there; there is hardly any user, and many arches still
do not define ARCH_HAS_SG_CHAIN. Do you have any pointer to discussions
or LWN articles?
I would need to add support for the long sglists to virtio; this is not
a problem, but in the past Rusty complained that long sg-lists are not
well suited to virtio (which would like to add elements not just at the
beginning of a given sglist, but also at the end). It seems to me that
virtio would prefer to work with a struct scatterlist ** rather than a
long sglist.
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 8:29 [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list Wang Sen
2012-07-25 8:44 ` Paolo Bonzini
@ 2012-07-25 10:04 ` Rolf Eike Beer
2012-07-25 11:46 ` Sen Wang
1 sibling, 1 reply; 34+ messages in thread
From: Rolf Eike Beer @ 2012-07-25 10:04 UTC (permalink / raw)
To: Wang Sen; +Cc: linux-scsi, jbottomley, pbonzini, stefanha, mc, linux-kernel
Am 25.07.2012 10:29, schrieb Wang Sen:
> When using the commands below to write some data to a virtio-scsi LUN
> of the
> QEMU guest(32-bit) with 1G physical memory(qemu -m 1024), the qemu
> will crash.
>
> # sudo mkfs.ext4 /dev/sdb (/dev/sdb is the virtio-scsi LUN.)
> # sudo mount /dev/sdb /mnt
> # dd if=/dev/zero of=/mnt/file bs=1M count=1024
>
> In current implementation, sg_set_buf is called to add buffers to sg
> list which
> is put into the virtqueue eventually.
The next sentence is somehow broken:
> But there are some HighMem pages in
> table->sgl can not get virtual address by sg_virt.
Maybe something like "But _if_ there are ... _you_ can not get ..."?
Eike
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 8:44 ` Paolo Bonzini
2012-07-25 9:22 ` Boaz Harrosh
@ 2012-07-25 10:41 ` Stefan Hajnoczi
2012-07-25 11:48 ` Sen Wang
2012-07-25 11:44 ` Sen Wang
2 siblings, 1 reply; 34+ messages in thread
From: Stefan Hajnoczi @ 2012-07-25 10:41 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Wang Sen, linux-scsi, JBottomley, mc, linux-kernel
On Wed, Jul 25, 2012 at 10:44:14AM +0200, Paolo Bonzini wrote:
> Il 25/07/2012 10:29, Wang Sen ha scritto:
> > When using the commands below to write some data to a virtio-scsi LUN of the
> > QEMU guest(32-bit) with 1G physical memory(qemu -m 1024), the qemu will crash.
> >
> > # sudo mkfs.ext4 /dev/sdb (/dev/sdb is the virtio-scsi LUN.)
> > # sudo mount /dev/sdb /mnt
> > # dd if=/dev/zero of=/mnt/file bs=1M count=1024
> >
> > In current implementation, sg_set_buf is called to add buffers to sg list which
> > is put into the virtqueue eventually. But there are some HighMem pages in
> > table->sgl can not get virtual address by sg_virt. So, sg_virt(sg_elem) may
> > return NULL value. This will cause QEMU exit when virtqueue_map_sg is called
> > in QEMU because an invalid GPA is passed by virtqueue.
>
> Heh, I was compiling (almost) the same patch as we speak. :)
>
> I've never seen QEMU crash; the VM would more likely just fail to boot
> with a panic. But it's the same bug anyway.
It's not a segfault "crash", I think it hits an abort(3) in QEMU's
virtio code when trying to map an invalid guest physical address.
Stefan
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 8:44 ` Paolo Bonzini
2012-07-25 9:22 ` Boaz Harrosh
2012-07-25 10:41 ` [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list Stefan Hajnoczi
@ 2012-07-25 11:44 ` Sen Wang
2012-07-25 12:40 ` Boaz Harrosh
2 siblings, 1 reply; 34+ messages in thread
From: Sen Wang @ 2012-07-25 11:44 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel
2012/7/25 Paolo Bonzini <pbonzini@redhat.com>:
> Il 25/07/2012 10:29, Wang Sen ha scritto:
>> When using the commands below to write some data to a virtio-scsi LUN of the
>> QEMU guest(32-bit) with 1G physical memory(qemu -m 1024), the qemu will crash.
>>
>> # sudo mkfs.ext4 /dev/sdb (/dev/sdb is the virtio-scsi LUN.)
>> # sudo mount /dev/sdb /mnt
>> # dd if=/dev/zero of=/mnt/file bs=1M count=1024
>>
>> In current implementation, sg_set_buf is called to add buffers to sg list which
>> is put into the virtqueue eventually. But there are some HighMem pages in
>> table->sgl can not get virtual address by sg_virt. So, sg_virt(sg_elem) may
>> return NULL value. This will cause QEMU exit when virtqueue_map_sg is called
>> in QEMU because an invalid GPA is passed by virtqueue.
>
> Heh, I was compiling (almost) the same patch as we speak. :)
Uh, what a coincidence! :)
>
> I've never seen QEMU crash; the VM would more likely just fail to boot
> with a panic. But it's the same bug anyway.
I never met this before. How this situation happens?
>
>> My solution is using sg_set_page instead of sg_set_buf.
>>
>> I have tested the patch on my workstation. QEMU would not crash any more.
>>
>> Signed-off-by: Wang Sen <senwang@linux.vnet.ibm.com>
>> ---
>> drivers/scsi/virtio_scsi.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
>> index 1b38431..fc5c88a 100644
>> --- a/drivers/scsi/virtio_scsi.c
>> +++ b/drivers/scsi/virtio_scsi.c
>> @@ -198,7 +198,8 @@ static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
>> int i;
>>
>> for_each_sg(table->sgl, sg_elem, table->nents, i)
>> - sg_set_buf(&sg[idx++], sg_virt(sg_elem), sg_elem->length);
>> + sg_set_page(&sg[idx++], sg_page(sg_elem), sg_elem->length,
>> + sg_elem->offset);
>
> This can simply be
>
> sg[idx++] = *sg_elem;
>
Yes, I saw your another E-mail. I think you're right. Simply calling
sg_set_page can not handle
the flag bits correctly. So, I'll repost the patch soon. Thank you!
> Can you repost it with this change, and also add stable@vger.kernel.org
> to the Cc? Thanks very much!
>
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
------------------------------------------
Wang Sen
Addr: XUPT,Xi'an,Shaanxi,China
Email: kelvin.xupt@gmail.com
------------------------------------------
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 10:04 ` Rolf Eike Beer
@ 2012-07-25 11:46 ` Sen Wang
0 siblings, 0 replies; 34+ messages in thread
From: Sen Wang @ 2012-07-25 11:46 UTC (permalink / raw)
To: Rolf Eike Beer
Cc: Wang Sen, linux-scsi, jbottomley, pbonzini, stefanha, mc,
linux-kernel
2012/7/25 Rolf Eike Beer <eike-kernel@sf-tec.de>:
> Am 25.07.2012 10:29, schrieb Wang Sen:
>
>> When using the commands below to write some data to a virtio-scsi LUN of
>> the
>> QEMU guest(32-bit) with 1G physical memory(qemu -m 1024), the qemu
>> will crash.
>>
>> # sudo mkfs.ext4 /dev/sdb (/dev/sdb is the virtio-scsi LUN.)
>> # sudo mount /dev/sdb /mnt
>> # dd if=/dev/zero of=/mnt/file bs=1M count=1024
>>
>> In current implementation, sg_set_buf is called to add buffers to sg
>> list which
>> is put into the virtqueue eventually.
>
>
> The next sentence is somehow broken:
>
>
>> But there are some HighMem pages in
>> table->sgl can not get virtual address by sg_virt.
>
>
> Maybe something like "But _if_ there are ... _you_ can not get ..."?
Thanks, I'll pay attention in next post.
>
> Eike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
------------------------------------------
Wang Sen
Addr: XUPT,Xi'an,Shaanxi,China
Email: kelvin.xupt@gmail.com
------------------------------------------
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 10:41 ` [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list Stefan Hajnoczi
@ 2012-07-25 11:48 ` Sen Wang
0 siblings, 0 replies; 34+ messages in thread
From: Sen Wang @ 2012-07-25 11:48 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Paolo Bonzini, Wang Sen, linux-scsi, JBottomley, mc, linux-kernel
2012/7/25 Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>:
> On Wed, Jul 25, 2012 at 10:44:14AM +0200, Paolo Bonzini wrote:
>> Il 25/07/2012 10:29, Wang Sen ha scritto:
>> > When using the commands below to write some data to a virtio-scsi LUN of the
>> > QEMU guest(32-bit) with 1G physical memory(qemu -m 1024), the qemu will crash.
>> >
>> > # sudo mkfs.ext4 /dev/sdb (/dev/sdb is the virtio-scsi LUN.)
>> > # sudo mount /dev/sdb /mnt
>> > # dd if=/dev/zero of=/mnt/file bs=1M count=1024
>> >
>> > In current implementation, sg_set_buf is called to add buffers to sg list which
>> > is put into the virtqueue eventually. But there are some HighMem pages in
>> > table->sgl can not get virtual address by sg_virt. So, sg_virt(sg_elem) may
>> > return NULL value. This will cause QEMU exit when virtqueue_map_sg is called
>> > in QEMU because an invalid GPA is passed by virtqueue.
>>
>> Heh, I was compiling (almost) the same patch as we speak. :)
>>
>> I've never seen QEMU crash; the VM would more likely just fail to boot
>> with a panic. But it's the same bug anyway.
>
> It's not a segfault "crash", I think it hits an abort(3) in QEMU's
> virtio code when trying to map an invalid guest physical address.
How the guest boot fail? I never met this case.
>
> Stefan
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
------------------------------------------
Wang Sen
Addr: XUPT,Xi'an,Shaanxi,China
Email: kelvin.xupt@gmail.com
------------------------------------------
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 9:41 ` Paolo Bonzini
@ 2012-07-25 12:34 ` Boaz Harrosh
2012-07-25 12:49 ` Paolo Bonzini
0 siblings, 1 reply; 34+ messages in thread
From: Boaz Harrosh @ 2012-07-25 12:34 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel
On 07/25/2012 12:41 PM, Paolo Bonzini wrote:
> Il 25/07/2012 11:22, Boaz Harrosh ha scritto:
>>>>>> for_each_sg(table->sgl, sg_elem, table->nents, i)
>>>>>> - sg_set_buf(&sg[idx++], sg_virt(sg_elem), sg_elem->length);
>>>>>> + sg_set_page(&sg[idx++], sg_page(sg_elem), sg_elem->length,
>>>>>> + sg_elem->offset);
>>>>
>>>> This can simply be
>>>>
>>>> sg[idx++] = *sg_elem;
>>>>
>>>> Can you repost it with this change, and also add stable@vger.kernel.org
>>>> to the Cc? Thanks very much!
>>>>
>>
>> No! Please use sg_set_page()! Look at sg_set_page(), which calls sg_assign_page().
>> It has all these jump over chained arrays. When you'll start using long
>> sg_lists (which you should) then jumping from chain to chain must go through
>> sg_page(sg_elem) && sg_assign_page(), As in the original patch.
>
> Hi Boaz,
>
> actually it seems to me that using sg_set_page is wrong, because it will
> not copy the end marker from table->sgl to sg[]. If something chained
> the sg[] scatterlist onto something else, sg_next's test for sg_is_last
> would go beyond the table->nents-th item and access invalid memory.
>
Yes, you did not understand this structure. And Yes I am right, when
using chained lists you *must* use sg_set_page().
You see the chaining belongs to the allocation not the value of the
sg-elements. One must not copy the chaining marker to the destination
array which might have it's own. And one must not crap all over the
destination chaining markers, set at allocation time.
The sizes and mostly the pointers of source and destination are not
the same.
> Using chained sglists is on my to-do list, I expect that it would make a
> nice performance improvement. However, I was a bit confused as to
> what's the plan there; there is hardly any user, and many arches still
> do not define ARCH_HAS_SG_CHAIN. Do you have any pointer to discussions
> or LWN articles?
>
Only the source code I'm afraid.
In SCSI land most LLDs should support chaining just by virtu of using the
for_each_sg macro. That all it takes. Your code above does support it.
(In Wang version).
Though more code need probably be added at sg allocation to actually
allocate and prepare a chain.
> I would need to add support for the long sglists to virtio; this is not
> a problem, but in the past Rusty complained that long sg-lists are not
> well suited to virtio (which would like to add elements not just at the
> beginning of a given sglist, but also at the end).
Well that can be done as well, (If done carefully) It should be easy to add
chained fragments to both the end of a given chain just as at beginning.
It only means that the last element of the appended-to chain moves to
the next fragment and it's place is replaced by a link.
If you have ready made two long segments A and C which you would like not
to reallocate and copy, you insert a two-elements segment in the middle, say
call it B.
The first element of B is the last element of A which is now used as the pointer
to B, and the 2nd element of B is a pointer to C.
> It seems to me that
> virtio would prefer to work with a struct scatterlist ** rather than a
> long sglist.
>
That's just going backwards, and lazy. As you said if you want to enjoy
the better performance cake you better break some eggs ;-)
> Paolo
Cheers
Boaz
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 11:44 ` Sen Wang
@ 2012-07-25 12:40 ` Boaz Harrosh
2012-07-27 3:12 ` Wang Sen
0 siblings, 1 reply; 34+ messages in thread
From: Boaz Harrosh @ 2012-07-25 12:40 UTC (permalink / raw)
To: Sen Wang
Cc: Paolo Bonzini, Wang Sen, linux-scsi, JBottomley, stefanha, mc,
linux-kernel
On 07/25/2012 02:44 PM, Sen Wang wrote:
> 2012/7/25 Paolo Bonzini <pbonzini@redhat.com>:
>> Il 25/07/2012 10:29, Wang Sen ha scritto:
>>> When using the commands below to write some data to a virtio-scsi LUN of the
>>> QEMU guest(32-bit) with 1G physical memory(qemu -m 1024), the qemu will crash.
>>>
>>> # sudo mkfs.ext4 /dev/sdb (/dev/sdb is the virtio-scsi LUN.)
>>> # sudo mount /dev/sdb /mnt
>>> # dd if=/dev/zero of=/mnt/file bs=1M count=1024
>>>
>>> In current implementation, sg_set_buf is called to add buffers to sg list which
>>> is put into the virtqueue eventually. But there are some HighMem pages in
>>> table->sgl can not get virtual address by sg_virt. So, sg_virt(sg_elem) may
>>> return NULL value. This will cause QEMU exit when virtqueue_map_sg is called
>>> in QEMU because an invalid GPA is passed by virtqueue.
>>
>> Heh, I was compiling (almost) the same patch as we speak. :)
>
> Uh, what a coincidence! :)
>
>>
>> I've never seen QEMU crash; the VM would more likely just fail to boot
>> with a panic. But it's the same bug anyway.
>
> I never met this before. How this situation happens?
>
>>
>>> My solution is using sg_set_page instead of sg_set_buf.
>>>
>>> I have tested the patch on my workstation. QEMU would not crash any more.
>>>
>>> Signed-off-by: Wang Sen <senwang@linux.vnet.ibm.com>
>>> ---
>>> drivers/scsi/virtio_scsi.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
>>> index 1b38431..fc5c88a 100644
>>> --- a/drivers/scsi/virtio_scsi.c
>>> +++ b/drivers/scsi/virtio_scsi.c
>>> @@ -198,7 +198,8 @@ static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
>>> int i;
>>>
>>> for_each_sg(table->sgl, sg_elem, table->nents, i)
>>> - sg_set_buf(&sg[idx++], sg_virt(sg_elem), sg_elem->length);
>>> + sg_set_page(&sg[idx++], sg_page(sg_elem), sg_elem->length,
>>> + sg_elem->offset);
>>
>> This can simply be
>>
>> sg[idx++] = *sg_elem;
>>
>
> Yes, I saw your another E-mail. I think you're right. Simply calling
> sg_set_page can not handle
> the flag bits correctly. So, I'll repost the patch soon. Thank you!
>
No this code is correct, though you will need to make sure to properly
terminate the destination sg_list.
But since old code was using sg_set_buf(), than it means it was properly
terminated before, and there for this code is good as is please don't
touch it.
Thanks
Boaz
>> Can you repost it with this change, and also add stable@vger.kernel.org
>> to the Cc? Thanks very much!
>>
>> Paolo
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 12:34 ` Boaz Harrosh
@ 2012-07-25 12:49 ` Paolo Bonzini
2012-07-25 13:26 ` Boaz Harrosh
0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-25 12:49 UTC (permalink / raw)
To: Boaz Harrosh; +Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel
Il 25/07/2012 14:34, Boaz Harrosh ha scritto:
>>>>>>> for_each_sg(table->sgl, sg_elem, table->nents, i)
>>>>>>> - sg_set_buf(&sg[idx++], sg_virt(sg_elem), sg_elem->length);
>>>>>>> + sg_set_page(&sg[idx++], sg_page(sg_elem), sg_elem->length,
>>>>>>> + sg_elem->offset);
>>>>>
>>>>> This can simply be
>>>>>
>>>>> sg[idx++] = *sg_elem;
>>>>>
>>>
>>> No! Please use sg_set_page()! Look at sg_set_page(), which calls sg_assign_page().
>>> It has all these jump over chained arrays. When you'll start using long
>>> sg_lists (which you should) then jumping from chain to chain must go through
>>> sg_page(sg_elem) && sg_assign_page(), As in the original patch.
>>
>> actually it seems to me that using sg_set_page is wrong, because it will
>> not copy the end marker from table->sgl to sg[]. If something chained
>> the sg[] scatterlist onto something else, sg_next's test for sg_is_last
>> would go beyond the table->nents-th item and access invalid memory.
>
>
> Yes, you did not understand this structure. And Yes I am right, when
> using chained lists you *must* use sg_set_page().
>
> You see the chaining belongs to the allocation not the value of the
> sg-elements. One must not copy the chaining marker to the destination
> array which might have it's own.
Except here the destination array has to be given to virtio, which
doesn't (yet) understand chaining. I'm using for_each_sg rather than a
simple memcpy exactly because I want to flatten the input scatterlist
onto consecutive scatterlist entries, which is what virtio expects (and
what I'll change when I get to it).
for_each_sg guarantees that I get non-chain scatterlists only, so it is
okay to value-assign them to sg[].
(Replying to your other message,
> No this code is correct, though you will need to make sure to properly
> terminate the destination sg_list.
>
> But since old code was using sg_set_buf(), than it means it was properly
> terminated before, and there for this code is good as is please don't
> touch it.
It was _not_ properly terminated, and didn't matter because virtio
doesn't care about termination. Changing all the virtio devices to
properly terminate chains (and to use for_each_sg) is a prerequisite for
properly supporting long sglists).
> In SCSI land most LLDs should support chaining just by virtu of using the
> for_each_sg macro. That all it takes. Your code above does support it.
Yes, it supports it but still has to undo them before passing to virtio.
What my LLD does is add a request descriptor in front of the scatterlist
that the LLD receives. I would like to do this with a 2-item
scatterlist: one for the request descriptor, and one which is a chain to
the original scatterlist. Except that if I call sg_chain and my
architecture does not define ARCH_HAS_SG_CHAIN, it will BUG out. So I
need to keep all the scatterlist allocation and copying crap that I have
now within #ifdef, and it will bitrot.
>> I would need to add support for the long sglists to virtio; this is not
>> a problem, but in the past Rusty complained that long sg-lists are not
>> well suited to virtio (which would like to add elements not just at the
>> beginning of a given sglist, but also at the end).
>
> Well that can be done as well, (If done carefully) It should be easy to add
> chained fragments to both the end of a given chain just as at beginning.
> It only means that the last element of the appended-to chain moves to
> the next fragment and it's place is replaced by a link.
But you cannot do that in constant time, can you? And that means you do
not enjoy any benefit in terms of cache misses etc.
Also, this assumes that I can modify the appended-to chain. I'm not
sure I can do this?
>> It seems to me that virtio would prefer to work with a struct
>> scatterlist ** rather than a long sglist.
>
> That's just going backwards, and lazy. As you said if you want to enjoy
> the better performance cake you better break some eggs ;-)
:)
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 12:49 ` Paolo Bonzini
@ 2012-07-25 13:26 ` Boaz Harrosh
2012-07-25 13:36 ` Paolo Bonzini
2012-07-25 14:17 ` virtio(-scsi) vs. chained sg_lists (was " Paolo Bonzini
0 siblings, 2 replies; 34+ messages in thread
From: Boaz Harrosh @ 2012-07-25 13:26 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel
On 07/25/2012 03:49 PM, Paolo Bonzini wrote:
>
> Except here the destination array has to be given to virtio, which
> doesn't (yet) understand chaining. I'm using for_each_sg rather than a
> simple memcpy exactly because I want to flatten the input scatterlist
> onto consecutive scatterlist entries, which is what virtio expects (and
> what I'll change when I get to it).
>
> for_each_sg guarantees that I get non-chain scatterlists only, so it is
> okay to value-assign them to sg[].
>
So if the virtio does not understand chaining at all then surly it will
not understand the 2-bit end marker and will get a wrong page pointer
with the 1st bit set.
As I said!! Since the last code did sg_set_buff() and worked then you must
change it with sg_set_page().
If there are *any* chaining-or-no-chaining markers errors these should
be fixed as a separate patch!
Please lets concentrate at the problem at hand.
<snip>
>
> It was _not_ properly terminated, and didn't matter because virtio
> doesn't care about termination. Changing all the virtio devices to
> properly terminate chains (and to use for_each_sg) is a prerequisite for
> properly supporting long sglists).
>
Fine then your code is now a crash because the terminating bit was just
copied over, which it was not before.
------------
Now Back to the how to support chaining:
Lets separate the two topics from now on. Send me one mail concerning
the proper above patch, And a different mail for how to support chaining.
>> In SCSI land most LLDs should support chaining just by virtu of using the
>> for_each_sg macro. That all it takes. Your code above does support it.
>
> Yes, it supports it but still has to undo them before passing to virtio.
>
> What my LLD does is add a request descriptor in front of the scatterlist
> that the LLD receives. I would like to do this with a 2-item
> scatterlist: one for the request descriptor, and one which is a chain to
> the original scatterlist.
I hate that plan. Why yet override the scatter element yet again with a third
union of a "request descriptor"
The reason it was overloaded as a link-pointer in the first place was because
of historical compatibility reasons and not because of a good design.
You should have a proper "request descriptor" structure defined, pointing to
(or followed by), an sglist-chain. And all of the above is mute.
> Except that if I call sg_chain and my
> architecture does not define ARCH_HAS_SG_CHAIN, it will BUG out. So I
> need to keep all the scatterlist allocation and copying crap that I have
> now within #ifdef, and it will bitrot.
>
except that with the correct design you don't call sg_chain you just do:
request_descriptor->sg_list = sg;
>>> I would need to add support for the long sglists to virtio; this is not
>>> a problem, but in the past Rusty complained that long sg-lists are not
>>> well suited to virtio (which would like to add elements not just at the
>>> beginning of a given sglist, but also at the end).
>>
>> Well that can be done as well, (If done carefully) It should be easy to add
>> chained fragments to both the end of a given chain just as at beginning.
>> It only means that the last element of the appended-to chain moves to
>> the next fragment and it's place is replaced by a link.
>
> But you cannot do that in constant time, can you? And that means you do
> not enjoy any benefit in terms of cache misses etc.
>
I did not understand "constant time" it is O(0) if that what you meant.
(And surly today's code of copy the full list "cache misses")
> Also, this assumes that I can modify the appended-to chain. I'm not
> sure I can do this?
>
Each case it's own. If the appended-to chain is const, yes you'll have
to reallocate it and copy. Is that your case?
Cheers
Boaz
>>> It seems to me that virtio would prefer to work with a struct
>>> scatterlist ** rather than a long sglist.
>>
>> That's just going backwards, and lazy. As you said if you want to enjoy
>> the better performance cake you better break some eggs ;-)
>
> :)
>
> Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 13:26 ` Boaz Harrosh
@ 2012-07-25 13:36 ` Paolo Bonzini
2012-07-25 14:36 ` Boaz Harrosh
2012-07-25 14:17 ` virtio(-scsi) vs. chained sg_lists (was " Paolo Bonzini
1 sibling, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-25 13:36 UTC (permalink / raw)
To: Boaz Harrosh; +Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel
Il 25/07/2012 15:26, Boaz Harrosh ha scritto:
> On 07/25/2012 03:49 PM, Paolo Bonzini wrote:
>
>>
>> Except here the destination array has to be given to virtio, which
>> doesn't (yet) understand chaining. I'm using for_each_sg rather than a
>> simple memcpy exactly because I want to flatten the input scatterlist
>> onto consecutive scatterlist entries, which is what virtio expects (and
>> what I'll change when I get to it).
>>
>> for_each_sg guarantees that I get non-chain scatterlists only, so it is
>> okay to value-assign them to sg[].
>
> So if the virtio does not understand chaining at all then surly it will
> not understand the 2-bit end marker and will get a wrong page pointer
> with the 1st bit set.
It doesn't understand chaining, but it does use sg_phys(x) so it will
not get a wrong page pointer for the end marker.
> Fine then your code is now a crash because the terminating bit was just
> copied over, which it was not before.
I did test the patch with value-assignment.
> Lets separate the two topics from now on. Send me one mail concerning
> the proper above patch, And a different mail for how to support chaining.
Ok, and I'll change the topic.
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-25 13:26 ` Boaz Harrosh
2012-07-25 13:36 ` Paolo Bonzini
@ 2012-07-25 14:17 ` Paolo Bonzini
2012-07-25 15:28 ` Boaz Harrosh
1 sibling, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-25 14:17 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel,
kvm@vger.kernel.org, Rusty Russell
Il 25/07/2012 15:26, Boaz Harrosh ha scritto:
>>> In SCSI land most LLDs should support chaining just by virtu of using the
>>> for_each_sg macro. That all it takes. Your code above does support it.
>>
>> Yes, it supports it but still has to undo them before passing to virtio.
>>
>> What my LLD does is add a request descriptor in front of the scatterlist
>> that the LLD receives. I would like to do this with a 2-item
>> scatterlist: one for the request descriptor, and one which is a chain to
>> the original scatterlist.
>
> I hate that plan. Why yet override the scatter element yet again with a third
> union of a "request descriptor"
I'm not overriding (or did you mean overloading?) anything, and I think
you're reading too much in my words.
What I am saying is (for a WRITE command):
1) what I get is a scsi_cmnd which contains an N-element scatterlist.
2) virtio-scsi has to build the "packet" that is passed to the hardware
(it does not matter that the hardware is virtual). This packet (per
virtio-scsi spec) has an N+1-element scatterlist, where the first
element is a request descriptor (struct virtio_scsi_cmd_req), and the
others describe the written data.
3) virtio takes care of converting the "packet" from a scatterlist
(which currently must be a flat one) to the hardware representation.
Here a walk is inevitable, so we don't care about this walk.
4) What I'm doing now: copying (and flattening) the N-element
scatterlist onto the last elements of an N+1 array that I pass to virtio.
_ _ _ _ _ _
|_|_|_|_|_|_| scsi_cmnd scatterlist
vvv COPY vvv
_ _ _ _ _ _ _
|_|_|_|_|_|_|_| scatterlist passed to virtio
|
virtio_scsi_cmd_req
Then I hand off the scatterlist to virtio. virtio walks it and converts
it to hardware format.
5) What I want to do: create a 2-element scatterlist, the first being
the request descriptor and the second chaining to scsi_cmnd's N-element
scatterlist.
_ _ _ _ _ _
|_|_|_|_|_|_| scsi_cmnd scatterlist
_ _/
|_|C| scatterlist passed to virtio
|
virtio_scsi_cmd_req
Then I hand off the scatterlist to virtio. virtio will still walk the
scatterlist chain, and convert it to N+1 elements for the hardware to
consume. Still, removing one walk largely reduces the length of my
critical sections. I also save some pointer-chasing because the
2-element scatterlist are short-lived and can reside on the stack.
Other details (you can probably skip these):
There is also a response descriptor. In the case of writes this is the
only element that the hardware will write to, so in the case of writes
the "written by hardware" scatterlist has 1 element only and does not
need chaining.
Reads are entirely symmetric. The hardware will read the request
descriptor from a 1-element scatterlist, and will write response+data
into an N+1-element scatterlist (the response descriptor precedes the
data that was read). It can be treated in exactly the same way. The
N+1-element scatterlist could also become a 2-element scatterlist with
chaining.
>> Except that if I call sg_chain and my
>> architecture does not define ARCH_HAS_SG_CHAIN, it will BUG out. So I
>> need to keep all the scatterlist allocation and copying crap that I have
>> now within #ifdef, and it will bitrot.
>
> except that with the correct design you don't call sg_chain you just do:
> request_descriptor->sg_list = sg;
By the above it should be clear, that request_descriptor is not a
driver-private extension of the scsi_cmnd. It is something passed to
the hardware.
>>> Well that can be done as well, (If done carefully) It should be easy to add
>>> chained fragments to both the end of a given chain just as at beginning.
>>> It only means that the last element of the appended-to chain moves to
>>> the next fragment and it's place is replaced by a link.
>>
>> But you cannot do that in constant time, can you? And that means you do
>> not enjoy any benefit in terms of cache misses etc.
>
> I did not understand "constant time" it is O(0) if that what you meant.
In the worst case it is a linked list, no? So in the worst case
_finding_ the last element of the appended-to chain is O(n).
Actually, appending to the end is not a problem for virtio-scsi. But
for example the virtio-blk spec places the response descriptor _after_
the input data. I think this was a mistake, and I didn't repeat it for
virtio-scsi, but I cited it because in the past Rusty complained that
the long sglist implementation was "SCSI-centric".
>> Also, this assumes that I can modify the appended-to chain. I'm not
>> sure I can do this?
>
> Each case it's own. If the appended-to chain is const, yes you'll have
> to reallocate it and copy. Is that your case?
It will be virtio-blk's case, but we can ignore it.
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 13:36 ` Paolo Bonzini
@ 2012-07-25 14:36 ` Boaz Harrosh
2012-07-25 15:09 ` performance improvements for the sglist API (Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list) Paolo Bonzini
0 siblings, 1 reply; 34+ messages in thread
From: Boaz Harrosh @ 2012-07-25 14:36 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel
On 07/25/2012 04:36 PM, Paolo Bonzini wrote:
> Il 25/07/2012 15:26, Boaz Harrosh ha scritto:
>> On 07/25/2012 03:49 PM, Paolo Bonzini wrote:
>>
>>>
>>> Except here the destination array has to be given to virtio, which
>>> doesn't (yet) understand chaining. I'm using for_each_sg rather than a
>>> simple memcpy exactly because I want to flatten the input scatterlist
>>> onto consecutive scatterlist entries, which is what virtio expects (and
>>> what I'll change when I get to it).
>>>
>>> for_each_sg guarantees that I get non-chain scatterlists only, so it is
>>> okay to value-assign them to sg[].
>>
>> So if the virtio does not understand chaining at all then surly it will
>> not understand the 2-bit end marker and will get a wrong page pointer
>> with the 1st bit set.
>
> It doesn't understand chaining, but it does use sg_phys(x) so it will
> not get a wrong page pointer for the end marker.
>
>> Fine then your code is now a crash because the terminating bit was just
>> copied over, which it was not before.
>
> I did test the patch with value-assignment.
>
Still you should use the sg_set_page()!!
1. It is not allowed to directly manipulate sg entries. One should always
use the proper accessor. Even if open coding does work and is not a bug
it should not be used anyway!
2. Future code that will support chaining will need to do as I say so why
change it then, again?
Please don't change two things in one patch. The fix is for high-pages
please fix only that here. You can blasphemy open-code the sg manipulation
in a separate patch.
Please
Boaz
>> Lets separate the two topics from now on. Send me one mail concerning
>> the proper above patch, And a different mail for how to support chaining.
>
> Ok, and I'll change the topic.
>
> Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* performance improvements for the sglist API (Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-25 14:36 ` Boaz Harrosh
@ 2012-07-25 15:09 ` Paolo Bonzini
2012-07-25 15:16 ` Paolo Bonzini
0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-25 15:09 UTC (permalink / raw)
To: Boaz Harrosh; +Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel
Il 25/07/2012 16:36, Boaz Harrosh ha scritto:
>> >
>> > I did test the patch with value-assignment.
>> >
>
> Still you should use the sg_set_page()!!
> 1. It is not allowed to directly manipulate sg entries. One should always
> use the proper accessor. Even if open coding does work and is not a bug
> it should not be used anyway!
> 2. Future code that will support chaining will need to do as I say so why
> change it then, again?
Future code that will support chaining will not copy anything at all.
Also, and more important, note that I am _not_ calling sg_init_table
before the loop, only once in the driver initialization. That's because
memset in sg_init_table is an absolute performance killer, especially if
you have to do it in a critical section; and I'm not making this up, see
blk_rq_map_sg:
* If the driver previously mapped a shorter
* list, we could see a termination bit
* prematurely unless it fully inits the sg
* table on each mapping. We KNOW that there
* must be more entries here or the driver
* would be buggy, so force clear the
* termination bit to avoid doing a full
* sg_init_table() in drivers for each command.
*/
sg->page_link &= ~0x02;
sg = sg_next(sg);
So let's instead fix the API so that I (and blk-merge.c) can touch
memory just once. For example you could add __sg_set_page and
__sg_set_buf, basically the equivalent of
memset(sg, 0, sizeof(*sg));
sg_set_{page,buf}(sg, page, len, offset);
Calling these functions would be fine if you later add a manual call to
sg_mark_end, again the same as blk-merge.c does. See the attached
untested/uncompiled patch.
And value assignment would be the same as a
__sg_set_page(sg, sg_page(page), sg->length, sg->offset);
> Please don't change two things in one patch. The fix is for high-pages
> please fix only that here. You can blasphemy open-code the sg manipulation
> in a separate patch.
The blasphemy is already there (the scatterlist that is handed to virtio
won't have the right end-of-chain marker). If anything,
value-assignment is trading a subtle blasphemy for a blatant one.
That's already an improvement, but let's just fix the API instead.
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: performance improvements for the sglist API (Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-25 15:09 ` performance improvements for the sglist API (Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list) Paolo Bonzini
@ 2012-07-25 15:16 ` Paolo Bonzini
0 siblings, 0 replies; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-25 15:16 UTC (permalink / raw)
To: Boaz Harrosh; +Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel
Il 25/07/2012 17:09, Paolo Bonzini ha scritto:
> Il 25/07/2012 16:36, Boaz Harrosh ha scritto:
>>>>
>>>> I did test the patch with value-assignment.
>>>>
>>
>> Still you should use the sg_set_page()!!
>> 1. It is not allowed to directly manipulate sg entries. One should always
>> use the proper accessor. Even if open coding does work and is not a bug
>> it should not be used anyway!
>> 2. Future code that will support chaining will need to do as I say so why
>> change it then, again?
>
> Future code that will support chaining will not copy anything at all.
>
> Also, and more important, note that I am _not_ calling sg_init_table
> before the loop, only once in the driver initialization. That's because
> memset in sg_init_table is an absolute performance killer, especially if
> you have to do it in a critical section; and I'm not making this up, see
> blk_rq_map_sg:
>
> * If the driver previously mapped a shorter
> * list, we could see a termination bit
> * prematurely unless it fully inits the sg
> * table on each mapping. We KNOW that there
> * must be more entries here or the driver
> * would be buggy, so force clear the
> * termination bit to avoid doing a full
> * sg_init_table() in drivers for each command.
> */
> sg->page_link &= ~0x02;
> sg = sg_next(sg);
>
> So let's instead fix the API so that I (and blk-merge.c) can touch
> memory just once. For example you could add __sg_set_page and
> __sg_set_buf, basically the equivalent of
>
> memset(sg, 0, sizeof(*sg));
> sg_set_{page,buf}(sg, page, len, offset);
>
> Calling these functions would be fine if you later add a manual call to
> sg_mark_end, again the same as blk-merge.c does. See the attached
> untested/uncompiled patch.
>
> And value assignment would be the same as a
>
> __sg_set_page(sg, sg_page(page), sg->length, sg->offset);
ENOPATCH...
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 160035f..00ba3d4 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -146,7 +146,9 @@ int blk_rq_map_sg(struct request_queue *q, struct request *rq,
new_segment:
if (!sg)
sg = sglist;
- else {
+ else
+ sg = sg_next(sg);
+
/*
* If the driver previously mapped a shorter
* list, we could see a termination bit
@@ -158,11 +160,7 @@ new_segment:
* termination bit to avoid doing a full
* sg_init_table() in drivers for each command.
*/
- sg->page_link &= ~0x02;
- sg = sg_next(sg);
- }
-
- sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset);
+ __sg_set_page(sg, bvec->bv_page, nbytes, bvec->bv_offset);
nsegs++;
}
bvprv = bvec;
@@ -182,12 +180,11 @@ new_segment:
if (rq->cmd_flags & REQ_WRITE)
memset(q->dma_drain_buffer, 0, q->dma_drain_size);
- sg->page_link &= ~0x02;
sg = sg_next(sg);
- sg_set_page(sg, virt_to_page(q->dma_drain_buffer),
- q->dma_drain_size,
- ((unsigned long)q->dma_drain_buffer) &
- (PAGE_SIZE - 1));
+ __sg_set_page(sg, virt_to_page(q->dma_drain_buffer),
+ q->dma_drain_size,
+ ((unsigned long)q->dma_drain_buffer) &
+ (PAGE_SIZE - 1));
nsegs++;
rq->extra_len += q->dma_drain_size;
}
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index ac9586d..d6a937e 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -44,32 +44,23 @@ struct sg_table {
#define sg_chain_ptr(sg) \
((struct scatterlist *) ((sg)->page_link & ~0x03))
-/**
- * sg_assign_page - Assign a given page to an SG entry
- * @sg: SG entry
- * @page: The page
- *
- * Description:
- * Assign page to sg entry. Also see sg_set_page(), the most commonly used
- * variant.
- *
- **/
-static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
+static inline void __sg_set_page(struct scatterlist *sg, struct page *page,
+ unsigned int len, unsigned int offset)
{
- unsigned long page_link = sg->page_link & 0x3;
-
/*
* In order for the low bit stealing approach to work, pages
* must be aligned at a 32-bit boundary as a minimum.
*/
BUG_ON((unsigned long) page & 0x03);
#ifdef CONFIG_DEBUG_SG
- BUG_ON(sg->sg_magic != SG_MAGIC);
- BUG_ON(sg_is_chain(sg));
+ sg->sg_magic = SG_MAGIC;
#endif
- sg->page_link = page_link | (unsigned long) page;
+ sg->page_link = page;
+ sg->offset = offset;
+ sg->length = len;
}
+
/**
* sg_set_page - Set sg entry to point at given page
* @sg: SG entry
@@ -87,9 +78,13 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page)
static inline void sg_set_page(struct scatterlist *sg, struct page *page,
unsigned int len, unsigned int offset)
{
- sg_assign_page(sg, page);
- sg->offset = offset;
- sg->length = len;
+ unsigned long flags = sg->page_link & 0x3;
+#ifdef CONFIG_DEBUG_SG
+ BUG_ON(sg->sg_magic != SG_MAGIC);
+ BUG_ON(sg_is_chain(sg));
+#endif
+ __sg_set_page(sg, page, len, offset);
+ sg->page_link |= flags;
}
static inline struct page *sg_page(struct scatterlist *sg)
@@ -101,6 +96,12 @@ static inline struct page *sg_page(struct scatterlist *sg)
return (struct page *)((sg)->page_link & ~0x3);
}
+static inline void __sg_set_buf(struct scatterlist *sg, const void *buf,
+ unsigned int buflen)
+{
+ __sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
+}
+
/**
* sg_set_buf - Set sg entry to point at given data
* @sg: SG entry
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-25 14:17 ` virtio(-scsi) vs. chained sg_lists (was " Paolo Bonzini
@ 2012-07-25 15:28 ` Boaz Harrosh
2012-07-25 17:43 ` Paolo Bonzini
0 siblings, 1 reply; 34+ messages in thread
From: Boaz Harrosh @ 2012-07-25 15:28 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel,
kvm@vger.kernel.org, Rusty Russell
On 07/25/2012 05:17 PM, Paolo Bonzini wrote:
> Il 25/07/2012 15:26, Boaz Harrosh ha scritto:
>>>> In SCSI land most LLDs should support chaining just by virtu of using the
>>>> for_each_sg macro. That all it takes. Your code above does support it.
>>>
>>> Yes, it supports it but still has to undo them before passing to virtio.
>>>
>>> What my LLD does is add a request descriptor in front of the scatterlist
>>> that the LLD receives. I would like to do this with a 2-item
>>> scatterlist: one for the request descriptor, and one which is a chain to
>>> the original scatterlist.
>>
>> I hate that plan. Why yet override the scatter element yet again with a third
>> union of a "request descriptor"
>
> I'm not overriding (or did you mean overloading?) anything, and I think
> you're reading too much in my words.
>
> What I am saying is (for a WRITE command):
>
> 1) what I get is a scsi_cmnd which contains an N-element scatterlist.
>
> 2) virtio-scsi has to build the "packet" that is passed to the hardware
> (it does not matter that the hardware is virtual). This packet (per
> virtio-scsi spec) has an N+1-element scatterlist, where the first
> element is a request descriptor (struct virtio_scsi_cmd_req), and the
> others describe the written data.
>
Then "virtio-scsi spec" is crap. It overloads the meaning of
"struct scatterlist" of the first element in an array. to be a
"struct virtio_scsi_cmd_req".
Instead of simply defining the protocol as passing a "struct virtio_scsi_cmd_req".
The following scatterlist can then be pointed to or followed in the stream.
But the above is just bad programming, regardless of standard or not.
Since you need to change the standard to support chaining then
it is a good time to fix this. (You need to change it because virtio
will now need to decode a guest-pointer at each chain-end to yet a new
guest-memory buffer)
In Practice you will get the same stream. a first packet of a
struct virtio_scsi_cmd_req followed or pointing to an array of
struct scatterlist. Only you don't have that contraption of virtio_scsi_cmd_req
must be the same size as "struct scatterlist" and all that union crap.
What where you guys smoking that day? ;-)
> 3) virtio takes care of converting the "packet" from a scatterlist
> (which currently must be a flat one) to the hardware representation.
> Here a walk is inevitable, so we don't care about this walk.
>
"hardware representation" you mean aio or biovec, what ever the
IO submission path uses at the host end?
> 4) What I'm doing now: copying (and flattening) the N-element
> scatterlist onto the last elements of an N+1 array that I pass to virtio.
>
> _ _ _ _ _ _
> |_|_|_|_|_|_| scsi_cmnd scatterlist
>
> vvv COPY vvv
> _ _ _ _ _ _ _
> |_|_|_|_|_|_|_| scatterlist passed to virtio
> |
> virtio_scsi_cmd_req
>
> Then I hand off the scatterlist to virtio. virtio walks it and converts
> it to hardware format.
>
Crap design. All that extra full vector copy. Just for that request
header - virtio_scsi_cmd_req.
Why are they at all related. Why does virtio_scsi_cmd_req need to be
as part of scatterlist and of the same size as the first element?
> 5) What I want to do: create a 2-element scatterlist, the first being
> the request descriptor and the second chaining to scsi_cmnd's N-element
> scatterlist.
>
> _ _ _ _ _ _
> |_|_|_|_|_|_| scsi_cmnd scatterlist
> _ _/
> |_|C| scatterlist passed to virtio
> |
> virtio_scsi_cmd_req
>
> Then I hand off the scatterlist to virtio. virtio will still walk the
> scatterlist chain, and convert it to N+1 elements for the hardware to
> consume. Still, removing one walk largely reduces the length of my
> critical sections. I also save some pointer-chasing because the
> 2-element scatterlist are short-lived and can reside on the stack.
>
Sure this is much better.
But since you are already changing the two code sites, Here, and at
virtio to support chained scatterlist, why not fix it properly
_ _ _ _ _ _
|_|_|_|_|_|_| scsi_cmnd scatterlist
_ _/_______________
|virtio_scsi_cmd_req| virtio_scsi_cmd_req passed to virtio
--------------------
Just a regularly defined header with an embedded pointer to a scatterlist.
And BTW the "scsi_cmnd scatterlist" can now be a chained one and virtio
can support that as well.
>
> Other details (you can probably skip these):
>
> There is also a response descriptor. In the case of writes this is the
> only element that the hardware will write to, so in the case of writes
> the "written by hardware" scatterlist has 1 element only and does not
> need chaining.
>
Again the scatterlist as a union for everything crap. Why why??
That is not at all SCSI. in iSCSI a response packet is well defined.
> Reads are entirely symmetric. The hardware will read the request
> descriptor from a 1-element scatterlist,
Why a scatterlist why not just a well defined READ-REQ struct.
> and will write response+data
> into an N+1-element scatterlist (the response descriptor precedes the
> data that was read). It can be treated in exactly the same way. The
> N+1-element scatterlist could also become a 2-element scatterlist with
> chaining.
>
Wahoo. Very bad design. This would not be accepted at T10.
And done by programmers? Talking about shoot one self in the foot.
How do you deal with different ARCHs having the different scatterlist
size. You need to have a different virtio driver at host for every ARCH
you want to emulate. (Yes I know I know, only self ARCH hosting, but why)
>>> Except that if I call sg_chain and my
>>> architecture does not define ARCH_HAS_SG_CHAIN, it will BUG out. So I
>>> need to keep all the scatterlist allocation and copying crap that I have
>>> now within #ifdef, and it will bitrot.
>>
>> except that with the correct design you don't call sg_chain you just do:
>> request_descriptor->sg_list = sg;
>
> By the above it should be clear, that request_descriptor is not a
> driver-private extension of the scsi_cmnd. It is something passed to
> the hardware.
>
Someone wanted to be smart and embed the request_descriptor inside the
existing streams. Only it was embedded in the wrong place, at the scatterlist.
A scatterlist should be just that. A descriptor of the bigger buffer that
is acted on as a contiguous stream but is constructed of smaller atomic buffers
like PAGEs.
Also in iscsi we pre-pend and sg_element *pointing* to the iscsi-req-header
followed by immediate data sglist, and sent via NET as a single skb.
But for you guys at virtio it is even more simple because since you contemplate
of chaining then it means you don't even need the data as a single contiguous
stream, since you can decode a pointer and continue fetching from that guest-memory
pointer.
I don't see any single merit in current design. And at the HW bits level you are
doing exactly as I describe only with a very messy code. And weirdly padded
structures.
Good luck
Boaz
>>>> Well that can be done as well, (If done carefully) It should be easy to add
>>>> chained fragments to both the end of a given chain just as at beginning.
>>>> It only means that the last element of the appended-to chain moves to
>>>> the next fragment and it's place is replaced by a link.
>>>
>>> But you cannot do that in constant time, can you? And that means you do
>>> not enjoy any benefit in terms of cache misses etc.
>>
>> I did not understand "constant time" it is O(0) if that what you meant.
>
> In the worst case it is a linked list, no? So in the worst case
> _finding_ the last element of the appended-to chain is O(n).
>
> Actually, appending to the end is not a problem for virtio-scsi. But
> for example the virtio-blk spec places the response descriptor _after_
> the input data. I think this was a mistake, and I didn't repeat it for
> virtio-scsi, but I cited it because in the past Rusty complained that
> the long sglist implementation was "SCSI-centric".
>
>>> Also, this assumes that I can modify the appended-to chain. I'm not
>>> sure I can do this?
>>
>> Each case it's own. If the appended-to chain is const, yes you'll have
>> to reallocate it and copy. Is that your case?
>
> It will be virtio-blk's case, but we can ignore it.
>
> Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-25 15:28 ` Boaz Harrosh
@ 2012-07-25 17:43 ` Paolo Bonzini
2012-07-25 19:16 ` Boaz Harrosh
0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-25 17:43 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel,
kvm@vger.kernel.org, Rusty Russell
Il 25/07/2012 17:28, Boaz Harrosh ha scritto:
>> 1) what I get is a scsi_cmnd which contains an N-element scatterlist.
>>
>> 2) virtio-scsi has to build the "packet" that is passed to the hardware
>> (it does not matter that the hardware is virtual). This packet (per
>> virtio-scsi spec) has an N+1-element scatterlist, where the first
>> element is a request descriptor (struct virtio_scsi_cmd_req), and the
>> others describe the written data.
>
> Then "virtio-scsi spec" is crap. It overloads the meaning of
> "struct scatterlist" of the first element in an array. to be a
> "struct virtio_scsi_cmd_req".
What the holy fuck? The first element simply _points_ to the "struct
virtio_scsi_cmd_req", just like subsequent elements point to the data.
And the protocol of the device is _not_ a struct scatterlist[]. The
virtio _API_ takes that array and converts to a series of physical
address + offset pairs.
> Since you need to change the standard to support chaining then
> it is a good time to fix this.
Perhaps it is a good time for you to read the virtio spec. You are
making a huge confusion between the LLD->virtio interface and the
virtio->hardware interface. I'm talking only of the former.
>> 3) virtio takes care of converting the "packet" from a scatterlist
>> (which currently must be a flat one) to the hardware representation.
>> Here a walk is inevitable, so we don't care about this walk.
>
> "hardware representation" you mean aio or biovec, what ever the
> IO submission path uses at the host end?
No, I mean the way the virtio spec encodes the physical address + offset
pairs.
I stopped reading here.
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-25 17:43 ` Paolo Bonzini
@ 2012-07-25 19:16 ` Boaz Harrosh
2012-07-25 20:06 ` Paolo Bonzini
0 siblings, 1 reply; 34+ messages in thread
From: Boaz Harrosh @ 2012-07-25 19:16 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel,
kvm@vger.kernel.org, Rusty Russell
On 07/25/2012 08:43 PM, Paolo Bonzini wrote:
> Il 25/07/2012 17:28, Boaz Harrosh ha scritto:
>>> 1) what I get is a scsi_cmnd which contains an N-element scatterlist.
>>>
>>> 2) virtio-scsi has to build the "packet" that is passed to the hardware
>>> (it does not matter that the hardware is virtual). This packet (per
>>> virtio-scsi spec) has an N+1-element scatterlist, where the first
>>> element is a request descriptor (struct virtio_scsi_cmd_req), and the
>>> others describe the written data.
>>
>> Then "virtio-scsi spec" is crap. It overloads the meaning of
>> "struct scatterlist" of the first element in an array. to be a
>> "struct virtio_scsi_cmd_req".
>
> What the holy fuck? The first element simply _points_ to the "struct
> virtio_scsi_cmd_req", just like subsequent elements point to the data.
>
> And the protocol of the device is _not_ a struct scatterlist[]. The
> virtio _API_ takes that array and converts to a series of physical
> address + offset pairs.
>
>> Since you need to change the standard to support chaining then
>> it is a good time to fix this.
>
> Perhaps it is a good time for you to read the virtio spec. You are
> making a huge confusion between the LLD->virtio interface and the
> virtio->hardware interface. I'm talking only of the former.
>
>>> 3) virtio takes care of converting the "packet" from a scatterlist
>>> (which currently must be a flat one) to the hardware representation.
>>> Here a walk is inevitable, so we don't care about this walk.
>>
>> "hardware representation" you mean aio or biovec, what ever the
>> IO submission path uses at the host end?
>
> No, I mean the way the virtio spec encodes the physical address + offset
> pairs.
>
> I stopped reading here.
>
The picture confused me. It looked like the first element is the virtio_scsi_cmd_req
not an sgilist-element that points to the struct's buffer.
In that case then yes your plan of making a two-elements fragment that points to the
original scsi-sglist is perfect. All you have to do is that, and all you have to do
at virtio is use the sg_for_each macro and you are done.
You don't need any sglist allocation or reshaping. And you can easily support
chaining. Looks like order of magnitude more simple then what you do now
So what is the problem?
And BTW you won't need that new __sg_set_page API anymore.
> Paolo
Cheers
Boaz
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-25 19:16 ` Boaz Harrosh
@ 2012-07-25 20:06 ` Paolo Bonzini
2012-07-25 21:04 ` Boaz Harrosh
0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-25 20:06 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel,
kvm@vger.kernel.org, Rusty Russell
Il 25/07/2012 21:16, Boaz Harrosh ha scritto:
> The picture confused me. It looked like the first element is the virtio_scsi_cmd_req
> not an sgilist-element that points to the struct's buffer.
>
> In that case then yes your plan of making a two-elements fragment that points to the
> original scsi-sglist is perfect. All you have to do is that, and all you have to do
> at virtio is use the sg_for_each macro and you are done.
>
> You don't need any sglist allocation or reshaping. And you can easily support
> chaining. Looks like order of magnitude more simple then what you do now
It is.
> So what is the problem?
That not all architectures have ARCH_HAS_SG_CHAIN (though all those I
care about do). So I need to go through all architectures and make sure
they use for_each_sg, or at least to change ARCH_HAS_SG_CHAIN to a
Kconfig define so that dependencies can be expressed properly.
> And BTW you won't need that new __sg_set_page API anymore.
Kind of.
sg_init_table(sg, 2);
sg_set_buf(sg[0], req, sizeof(req));
sg_chain(sg[1], scsi_out(sc));
is still a little bit worse than
__sg_set_buf(sg[0], req, sizeof(req));
__sg_chain(sg[1], scsi_out(sc));
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-25 20:06 ` Paolo Bonzini
@ 2012-07-25 21:04 ` Boaz Harrosh
2012-07-26 7:23 ` Paolo Bonzini
0 siblings, 1 reply; 34+ messages in thread
From: Boaz Harrosh @ 2012-07-25 21:04 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel,
kvm@vger.kernel.org, Rusty Russell
On 07/25/2012 11:06 PM, Paolo Bonzini wrote:
> Il 25/07/2012 21:16, Boaz Harrosh ha scritto:
>> The picture confused me. It looked like the first element is the virtio_scsi_cmd_req
>> not an sgilist-element that points to the struct's buffer.
>>
>> In that case then yes your plan of making a two-elements fragment that points to the
>> original scsi-sglist is perfect. All you have to do is that, and all you have to do
>> at virtio is use the sg_for_each macro and you are done.
>>
>> You don't need any sglist allocation or reshaping. And you can easily support
>> chaining. Looks like order of magnitude more simple then what you do now
>
> It is.
>
>> So what is the problem?
>
> That not all architectures have ARCH_HAS_SG_CHAIN (though all those I
> care about do). So I need to go through all architectures and make sure
> they use for_each_sg, or at least to change ARCH_HAS_SG_CHAIN to a
> Kconfig define so that dependencies can be expressed properly.
>
What is actually preventing ARCH_HAS_SG_CHAIN from all these ARCHES?
is that the DMA drivers not using for_each_sg(). Sounds like easy
to fix.
But yes a deep change would convert ARCH_HAS_SG_CHAIN to a Kconfig.
If you want to be lazy, like me, You might just put a BUILD_BUG_ON
in code, requesting the user to disable the driver for this ARCH.
I bet there is more things to do at ARCH to enable virtualization
then just support ARCH_HAS_SG_CHAIN. Be it just another requirement.
If you Document it and make sure current ARCHs are fine, it should
not ever trigger.
>> And BTW you won't need that new __sg_set_page API anymore.
>
> Kind of.
>
> sg_init_table(sg, 2);
> sg_set_buf(sg[0], req, sizeof(req));
> sg_chain(sg[1], scsi_out(sc));
>
> is still a little bit worse than
>
> __sg_set_buf(sg[0], req, sizeof(req));
> __sg_chain(sg[1], scsi_out(sc));
>
I believe they are the same, specially for the
on the stack 2 elements array. Actually I think
In both cases you need to at least call sg_init_table()
once after allocation, No?
Your old code with big array copy and re-shaping was
a better example of the need for your new API. Which I agree.
But please for my sake do not call it __sg_chain. Call it
something like sg_chain_not_end(). I hate those "__" which
for god sack means what?
(A good name is when I don't have to read the code, an "__"
means "fuck you go read the code")
> Paolo
Thanks
Boaz
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-25 21:04 ` Boaz Harrosh
@ 2012-07-26 7:23 ` Paolo Bonzini
2012-07-26 7:56 ` Boaz Harrosh
0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-26 7:23 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel,
kvm@vger.kernel.org, Rusty Russell
Il 25/07/2012 23:04, Boaz Harrosh ha scritto:
>> That not all architectures have ARCH_HAS_SG_CHAIN (though all those I
>> care about do). So I need to go through all architectures and make sure
>> they use for_each_sg, or at least to change ARCH_HAS_SG_CHAIN to a
>> Kconfig define so that dependencies can be expressed properly.
>
> What is actually preventing ARCH_HAS_SG_CHAIN from all these ARCHES?
> is that the DMA drivers not using for_each_sg(). Sounds like easy
> to fix.
>
> But yes a deep change would convert ARCH_HAS_SG_CHAIN to a Kconfig.
>
> If you want to be lazy, like me, You might just put a BUILD_BUG_ON
> in code, requesting the user to disable the driver for this ARCH.
>
> I bet there is more things to do at ARCH to enable virtualization
> then just support ARCH_HAS_SG_CHAIN. Be it just another requirement.
Actually, many arches run just fine with QEMU's binary code translation
(alpha, mips, coldfire). And since it's already pretty slow, using
virtio to improve performance is nice even in that scenario (which does
not require any kernel change). But it's just an icing on the cake
indeed. I can live with a BUILD_BUG_ON or better a "depends on
HAVE_SG_CHAIN" for the time being.
>>> And BTW you won't need that new __sg_set_page API anymore.
>>
>> Kind of.
>>
>> sg_init_table(sg, 2);
>> sg_set_buf(sg[0], req, sizeof(req));
>> sg_chain(sg[1], scsi_out(sc));
>>
>> is still a little bit worse than
>>
>> __sg_set_buf(sg[0], req, sizeof(req));
>> __sg_chain(sg[1], scsi_out(sc));
>
>
> I believe they are the same, specially for the
> on the stack 2 elements array. Actually I think
> In both cases you need to at least call sg_init_table()
> once after allocation, No?
Because the scatterlist here would be allocated on the stack, I would
need to call it every time if I used sg_set_buf/sg_chain.
But sg_init_table is really just memset + set SG_MAGIC + sg_mark_end.
If you use the "full-init" APIs as above, you don't need any of those
(sg_chain undoes the flag changes in sg_mark_end and vice versa).
> But please for my sake do not call it __sg_chain. Call it
> something like sg_chain_not_end(). I hate those "__" which
> for god sack means what?
> (A good name is when I don't have to read the code, an "__"
> means "fuck you go read the code")
Right, better call them sg_init_buf/sg_init_page/sg_init_chain.
In the meanwhile, we still have a bug to fix, and we need to choose
between Sen Wang's v1 (sg_set_page) or v2 (value assignment). I'm still
leaning more towards v2, if only because I already tested that one myself.
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-26 7:23 ` Paolo Bonzini
@ 2012-07-26 7:56 ` Boaz Harrosh
2012-07-26 7:58 ` Paolo Bonzini
0 siblings, 1 reply; 34+ messages in thread
From: Boaz Harrosh @ 2012-07-26 7:56 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel,
kvm@vger.kernel.org, Rusty Russell
On 07/26/2012 10:23 AM, Paolo Bonzini wrote:
>
> In the meanwhile, we still have a bug to fix, and we need to choose
> between Sen Wang's v1 (sg_set_page) or v2 (value assignment). I'm still
> leaning more towards v2, if only because I already tested that one myself.
>
It's your call, you know what I think ;-)
I Agree none of them are bugs in current code, they will both work
just the same.
> Paolo
Please CC me on the "convert to sg copy-less" patches, It looks interesting
Thanks
Boaz
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-26 7:56 ` Boaz Harrosh
@ 2012-07-26 7:58 ` Paolo Bonzini
2012-07-26 13:05 ` Paolo Bonzini
0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-26 7:58 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel,
kvm@vger.kernel.org, Rusty Russell
Il 26/07/2012 09:56, Boaz Harrosh ha scritto:
>> > In the meanwhile, we still have a bug to fix, and we need to choose
>> > between Sen Wang's v1 (sg_set_page) or v2 (value assignment). I'm still
>> > leaning more towards v2, if only because I already tested that one myself.
>
> It's your call, you know what I think ;-)
>
> I Agree none of them are bugs in current code, they will both work
> just the same.
Cool. Then, Sen, please fix the commit message in v2 and repost.
> Please CC me on the "convert to sg copy-less" patches, It looks interesting
Sure.
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-26 7:58 ` Paolo Bonzini
@ 2012-07-26 13:05 ` Paolo Bonzini
2012-07-27 6:27 ` Rusty Russell
0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-26 13:05 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel,
kvm@vger.kernel.org, Rusty Russell
[-- Attachment #1: Type: text/plain, Size: 1051 bytes --]
Il 26/07/2012 09:58, Paolo Bonzini ha scritto:
>
>> > Please CC me on the "convert to sg copy-less" patches, It looks interesting
> Sure.
Well, here is the gist of it (note it won't apply on any public tree,
hence no SoB yet). It should be split in multiple changesets and you
can make more simplifications on top of it, because
virtio_scsi_target_state is not anymore variable-sized, but that's
secondary.
The patch includes the conversion of virtio_ring.c to use sg_next.
It is a bit ugly because of backwards-compatibility, it can be fixed
by going through all drivers; not that there are many.
I'm still a bit worried of the future of this feature though. I would
be the first and (because of the non-portability) presumably sole user
for some time. Someone axing chained sg-lists in the future does not
seem too unlikely. So in practice I would rather just add to virtio a
function that takes two struct scatterlist ** (or an array of struct
scatterlist * + size pairs). Converting to sg_chain once it takes off
would be trivial.
Paolo
[-- Attachment #2: 0001-virtio-scsi-use-chained-sg_lists.patch --]
[-- Type: text/x-patch, Size: 14378 bytes --]
>From 57f8d4a20cbe9b3b25e10cd0595d7ac102fc8f73 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Thu, 26 Jul 2012 09:58:14 +0200
Subject: [PATCH] virtio-scsi: use chained sg_lists
Using chained sg_lists simplifies everything a lot.
The scatterlists we pass to virtio are always of bounded size,
and can be allocated on the stack. This means we do not need to
take tgt_lock and struct virtio_scsi_target_state does not have
anymore a flexible array at the end, so we can avoid a pointer
access.
---
drivers/block/virtio_blk.c | 3 +
drivers/scsi/virtio_scsi.c | 93 ++++++++++++++++---------------------------
drivers/virtio/virtio_ring.c | 93 +++++++++++++++++++++++++++++++------------
include/linux/virtio.h | 8 +++
4 files changed, 114 insertions(+), 83 deletions(-)
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 13f7ccb..ef65ea1 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -66,9 +66,6 @@ struct virtio_scsi_target_state {
struct virtio_scsi_vq *req_vq;
atomic_t reqs;
-
- /* For sglist construction when adding commands to the virtqueue. */
- struct scatterlist sg[];
};
/* Driver instance state */
@@ -362,20 +359,6 @@ static void virtscsi_event_done(struct virtqueue *vq)
spin_unlock_irqrestore(&vscsi->event_vq.vq_lock, flags);
};
-static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
- struct scsi_data_buffer *sdb)
-{
- struct sg_table *table = &sdb->table;
- struct scatterlist *sg_elem;
- unsigned int idx = *p_idx;
- int i;
-
- for_each_sg(table->sgl, sg_elem, table->nents, i)
- sg_set_buf(&sg[idx++], sg_virt(sg_elem), sg_elem->length);
-
- *p_idx = idx;
-}
-
/**
* virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist
* @vscsi : virtio_scsi state
@@ -384,52 +367,57 @@ static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
* @in_num : number of write-only elements
* @req_size : size of the request buffer
* @resp_size : size of the response buffer
- *
- * Called with tgt_lock held.
*/
-static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
- struct virtio_scsi_cmd *cmd,
- unsigned *out_num, unsigned *in_num,
+static void virtscsi_map_cmd(struct virtio_scsi_cmd *cmd,
+ struct scatterlist *sg_out,
+ unsigned *out_num,
+ struct scatterlist *sg_in,
+ unsigned *in_num,
size_t req_size, size_t resp_size)
{
struct scsi_cmnd *sc = cmd->sc;
- struct scatterlist *sg = tgt->sg;
- unsigned int idx = 0;
+
+ sg_init_table(sg_out, 2);
+ sg_init_table(sg_in, 2);
/* Request header. */
- sg_set_buf(&sg[idx++], &cmd->req, req_size);
+ sg_set_buf(&sg_out[0], &cmd->req, req_size);
+ *out_num = 1;
/* Data-out buffer. */
- if (sc && sc->sc_data_direction != DMA_FROM_DEVICE)
- virtscsi_map_sgl(sg, &idx, scsi_out(sc));
-
- *out_num = idx;
+ if (sc && sc->sc_data_direction != DMA_FROM_DEVICE) {
+ struct sg_table *table = &scsi_out(sc)->table;
+ sg_chain(sg_out, 2, table->sgl);
+ *out_num += table->nents;
+ }
/* Response header. */
- sg_set_buf(&sg[idx++], &cmd->resp, resp_size);
+ sg_set_buf(&sg_in[0], &cmd->resp, resp_size);
+ *in_num = 1;
/* Data-in buffer */
- if (sc && sc->sc_data_direction != DMA_TO_DEVICE)
- virtscsi_map_sgl(sg, &idx, scsi_in(sc));
-
- *in_num = idx - *out_num;
+ if (sc && sc->sc_data_direction != DMA_TO_DEVICE) {
+ struct sg_table *table = &scsi_in(sc)->table;
+ sg_chain(sg_in, 2, table->sgl);
+ *in_num += table->nents;
+ }
}
-static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
- struct virtio_scsi_vq *vq,
+static int virtscsi_kick_cmd(struct virtio_scsi_vq *vq,
struct virtio_scsi_cmd *cmd,
size_t req_size, size_t resp_size, gfp_t gfp)
{
+ struct scatterlist sg_out[2], sg_in[2];
unsigned int out_num, in_num;
unsigned long flags;
int ret;
- spin_lock_irqsave(&tgt->tgt_lock, flags);
- virtscsi_map_cmd(tgt, cmd, &out_num, &in_num, req_size, resp_size);
+ virtscsi_map_cmd(cmd, sg_out, &out_num, sg_in, &in_num,
+ req_size, resp_size);
- spin_lock(&vq->vq_lock);
- ret = virtqueue_add_buf(vq->vq, tgt->sg, out_num, in_num, cmd, gfp);
- spin_unlock(&tgt->tgt_lock);
+ spin_lock_irqsave(&vq->vq_lock, flags);
+ ret = virtqueue_add_buf_sg(vq->vq, sg_out, out_num, sg_in, in_num,
+ cmd, gfp);
if (ret >= 0)
ret = virtqueue_kick_prepare(vq->vq);
@@ -447,9 +435,6 @@ static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
struct virtio_scsi_cmd *cmd;
int ret;
- struct Scsi_Host *shost = virtio_scsi_host(vscsi->vdev);
- BUG_ON(scsi_sg_count(sc) > shost->sg_tablesize);
-
/* TODO: check feature bit and fail if unsupported? */
BUG_ON(sc->sc_data_direction == DMA_BIDIRECTIONAL);
@@ -477,7 +462,7 @@ static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
BUG_ON(sc->cmd_len > VIRTIO_SCSI_CDB_SIZE);
memcpy(cmd->req.cmd.cdb, sc->cmnd, sc->cmd_len);
- if (virtscsi_kick_cmd(tgt, tgt->req_vq, cmd,
+ if (virtscsi_kick_cmd(tgt->req_vq, cmd,
sizeof cmd->req.cmd, sizeof cmd->resp.cmd,
GFP_ATOMIC) >= 0)
ret = 0;
@@ -519,11 +504,10 @@ static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd *cmd)
{
DECLARE_COMPLETION_ONSTACK(comp);
- struct virtio_scsi_target_state *tgt = vscsi->tgt[cmd->sc->device->id];
int ret = FAILED;
cmd->comp = ∁
- if (virtscsi_kick_cmd(tgt, &vscsi->ctrl_vq, cmd,
+ if (virtscsi_kick_cmd(&vscsi->ctrl_vq, cmd,
sizeof cmd->req.tmf, sizeof cmd->resp.tmf,
GFP_NOIO) < 0)
goto out;
@@ -642,20 +626,16 @@ static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
}
static struct virtio_scsi_target_state *virtscsi_alloc_tgt(
- struct virtio_scsi *vscsi, u32 sg_elems)
+ struct virtio_scsi *vscsi)
{
struct virtio_scsi_target_state *tgt;
gfp_t gfp_mask = GFP_KERNEL;
- /* We need extra sg elements at head and tail. */
- tgt = kmalloc(sizeof(*tgt) + sizeof(tgt->sg[0]) * (sg_elems + 2),
- gfp_mask);
-
+ tgt = kmalloc(sizeof(*tgt), gfp_mask);
if (!tgt)
return NULL;
spin_lock_init(&tgt->tgt_lock);
- sg_init_table(tgt->sg, sg_elems + 2);
atomic_set(&tgt->reqs, 0);
/*
@@ -698,7 +678,7 @@ static int virtscsi_init(struct virtio_device *vdev,
struct virtio_scsi *vscsi, int num_targets)
{
int err;
- u32 i, sg_elems;
+ u32 i;
u32 num_vqs;
vq_callback_t **callbacks;
const char **names;
@@ -740,9 +720,6 @@ static int virtscsi_init(struct virtio_device *vdev,
if (virtio_has_feature(vdev, VIRTIO_SCSI_F_HOTPLUG))
virtscsi_kick_event_all(vscsi);
- /* We need to know how many segments before we allocate. */
- sg_elems = virtscsi_config_get(vdev, seg_max) ?: 1;
-
vscsi->tgt = kmalloc(num_targets *
sizeof(struct virtio_scsi_target_state *), GFP_KERNEL);
if (!vscsi->tgt) {
@@ -750,7 +727,7 @@ static int virtscsi_init(struct virtio_device *vdev,
goto out;
}
for (i = 0; i < num_targets; i++) {
- vscsi->tgt[i] = virtscsi_alloc_tgt(vscsi, sg_elems);
+ vscsi->tgt[i] = virtscsi_alloc_tgt(vscsi);
if (!vscsi->tgt[i]) {
err = -ENOMEM;
goto out;
--
1.7.1
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 693187d..d359e35 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -155,6 +155,9 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
num = blk_rq_map_sg(q, vbr->req, vblk->sg + out);
+ /* Remove scatterlist terminator, we will tack more items soon. */
+ vblk->sg[num + out - 1].page_link &= ~0x2;
+
if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
sg_set_buf(&vblk->sg[num + out + in++], vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
sg_set_buf(&vblk->sg[num + out + in++], &vbr->in_hdr,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 9c5aeea..fda723c 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -126,12 +126,14 @@ struct vring_virtqueue
/* Set up an indirect table of descriptors and add it to the queue. */
static int vring_add_indirect(struct vring_virtqueue *vq,
- struct scatterlist sg[],
+ struct scatterlist *sg_out,
unsigned int out,
+ struct scatterlist *sg_in,
unsigned int in,
gfp_t gfp)
{
struct vring_desc *desc;
+ struct scatterlist *sg_last = NULL;
unsigned head;
int i;
@@ -139,20 +141,23 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
if (!desc)
return -ENOMEM;
- /* Transfer entries from the sg list into the indirect page */
+ /* Transfer entries from the sg_out list into the indirect page */
for (i = 0; i < out; i++) {
desc[i].flags = VRING_DESC_F_NEXT;
- desc[i].addr = sg_phys(sg);
- desc[i].len = sg->length;
+ desc[i].addr = sg_phys(sg_out);
+ desc[i].len = sg_out->length;
desc[i].next = i+1;
- sg++;
+ sg_last = sg_out;
+ sg_out = sg_next(sg_out);
}
+ if (!sg_in)
+ sg_in = sg_out ? sg_out : ++sg_last;
for (; i < (out + in); i++) {
desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
- desc[i].addr = sg_phys(sg);
- desc[i].len = sg->length;
+ desc[i].addr = sg_phys(sg_in);
+ desc[i].len = sg_in->length;
desc[i].next = i+1;
- sg++;
+ sg_in = sg_next(sg_in);
}
/* Last one doesn't continue. */
@@ -189,36 +194,44 @@ int virtqueue_get_queue_index(struct virtqueue *_vq)
EXPORT_SYMBOL_GPL(virtqueue_get_queue_index);
/**
- * virtqueue_add_buf - expose buffer to other end
+ * virtqueue_add_buf_sg - expose buffer to other end
* @vq: the struct virtqueue we're talking about.
- * @sg: the description of the buffer(s).
+ * @sg_out: the description of the output buffer(s).
* @out_num: the number of sg readable by other side
- * @in_num: the number of sg which are writable (after readable ones)
+ * @sg_in: the description of the input buffer(s).
+ * @in_num: the number of sg which are writable
* @data: the token identifying the buffer.
* @gfp: how to do memory allocations (if necessary).
*
* Caller must ensure we don't call this with other virtqueue operations
* at the same time (except where noted).
*
+ * If sg_in is NULL, the writable entries come in sg_out after the
+ * first out_num.
+ *
* Returns remaining capacity of queue or a negative error
* (ie. ENOSPC). Note that it only really makes sense to treat all
* positive return values as "available": indirect buffers mean that
* we can put an entire sg[] array inside a single queue entry.
*/
-int virtqueue_add_buf(struct virtqueue *_vq,
- struct scatterlist sg[],
- unsigned int out,
- unsigned int in,
- void *data,
- gfp_t gfp)
+int virtqueue_add_buf_sg(struct virtqueue *_vq,
+ struct scatterlist *sg_out,
+ unsigned int out,
+ struct scatterlist *sg_in,
+ unsigned int in,
+ void *data,
+ gfp_t gfp)
{
struct vring_virtqueue *vq = to_vvq(_vq);
+ struct scatterlist *sg_last = NULL;
unsigned int i, avail, uninitialized_var(prev);
int head;
START_USE(vq);
BUG_ON(data == NULL);
+ BUG_ON(!sg_out && !sg_in);
+ BUG_ON(out + in == 0);
#ifdef DEBUG
{
@@ -236,13 +249,12 @@ int virtqueue_add_buf(struct virtqueue *_vq,
/* If the host supports indirect descriptor tables, and we have multiple
* buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && (out + in) > 1 && vq->num_free) {
- head = vring_add_indirect(vq, sg, out, in, gfp);
+ head = vring_add_indirect(vq, sg_out, out, sg_in, in, gfp);
if (likely(head >= 0))
goto add_head;
}
BUG_ON(out + in > vq->vring.num);
- BUG_ON(out + in == 0);
if (vq->num_free < out + in) {
pr_debug("Can't add buf len %i - avail = %i\n",
@@ -262,17 +274,20 @@ int virtqueue_add_buf(struct virtqueue *_vq,
head = vq->free_head;
for (i = vq->free_head; out; i = vq->vring.desc[i].next, out--) {
vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
- vq->vring.desc[i].addr = sg_phys(sg);
- vq->vring.desc[i].len = sg->length;
+ vq->vring.desc[i].addr = sg_phys(sg_out);
+ vq->vring.desc[i].len = sg_out->length;
prev = i;
- sg++;
+ sg_last = sg_out;
+ sg_out = sg_next(sg_out);
}
+ if (!sg_in)
+ sg_in = sg_out ? sg_out : ++sg_last;
for (; in; i = vq->vring.desc[i].next, in--) {
vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
- vq->vring.desc[i].addr = sg_phys(sg);
- vq->vring.desc[i].len = sg->length;
+ vq->vring.desc[i].addr = sg_phys(sg_in);
+ vq->vring.desc[i].len = sg_in->length;
prev = i;
- sg++;
+ sg_in = sg_next(sg_in);
}
/* Last one doesn't continue. */
vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
@@ -305,6 +320,34 @@ add_head:
return vq->num_free;
}
+EXPORT_SYMBOL_GPL(virtqueue_add_buf_sg);
+
+/**
+ * virtqueue_add_buf - expose buffer to other end
+ * @vq: the struct virtqueue we're talking about.
+ * @sg: the description of the buffer(s).
+ * @out_num: the number of sg readable by other side
+ * @in_num: the number of sg which are writable (after readable ones)
+ * @data: the token identifying the buffer.
+ * @gfp: how to do memory allocations (if necessary).
+ *
+ * Caller must ensure we don't call this with other virtqueue operations
+ * at the same time (except where noted).
+ *
+ * Returns remaining capacity of queue or a negative error
+ * (ie. ENOSPC). Note that it only really makes sense to treat all
+ * positive return values as "available": indirect buffers mean that
+ * we can put an entire sg[] array inside a single queue entry.
+ */
+int virtqueue_add_buf(struct virtqueue *_vq,
+ struct scatterlist sg[],
+ unsigned int out,
+ unsigned int in,
+ void *data,
+ gfp_t gfp)
+{
+ return virtqueue_add_buf_sg(_vq, sg, out, NULL, in, data, gfp);
+}
EXPORT_SYMBOL_GPL(virtqueue_add_buf);
/**
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 0ac3d3c..f0fe367 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -25,6 +25,14 @@ struct virtqueue {
void *priv;
};
+int virtqueue_add_buf_sg(struct virtqueue *vq,
+ struct scatterlist *sg_out,
+ unsigned int out_num,
+ struct scatterlist *sg_in,
+ unsigned int in_num,
+ void *data,
+ gfp_t gfp);
+
int virtqueue_add_buf(struct virtqueue *vq,
struct scatterlist sg[],
unsigned int out_num,
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-25 12:40 ` Boaz Harrosh
@ 2012-07-27 3:12 ` Wang Sen
2012-07-27 6:50 ` Paolo Bonzini
0 siblings, 1 reply; 34+ messages in thread
From: Wang Sen @ 2012-07-27 3:12 UTC (permalink / raw)
To: Boaz Harrosh
Cc: Paolo Bonzini, Wang Sen, linux-scsi, JBottomley, stefanha, mc,
linux-kernel
2012/7/25 Boaz Harrosh <bharrosh@panasas.com>:
> On 07/25/2012 02:44 PM, Sen Wang wrote:
>
>> 2012/7/25 Paolo Bonzini <pbonzini@redhat.com>:
>>> Il 25/07/2012 10:29, Wang Sen ha scritto:
>>>> When using the commands below to write some data to a virtio-scsi LUN of the
>>>> QEMU guest(32-bit) with 1G physical memory(qemu -m 1024), the qemu will crash.
>>>>
>>>> # sudo mkfs.ext4 /dev/sdb (/dev/sdb is the virtio-scsi LUN.)
>>>> # sudo mount /dev/sdb /mnt
>>>> # dd if=/dev/zero of=/mnt/file bs=1M count=1024
>>>>
>>>> In current implementation, sg_set_buf is called to add buffers to sg list which
>>>> is put into the virtqueue eventually. But there are some HighMem pages in
>>>> table->sgl can not get virtual address by sg_virt. So, sg_virt(sg_elem) may
>>>> return NULL value. This will cause QEMU exit when virtqueue_map_sg is called
>>>> in QEMU because an invalid GPA is passed by virtqueue.
>>>
>>> Heh, I was compiling (almost) the same patch as we speak. :)
>>
>> Uh, what a coincidence! :)
>>
>>>
>>> I've never seen QEMU crash; the VM would more likely just fail to boot
>>> with a panic. But it's the same bug anyway.
>>
>> I never met this before. How this situation happens?
>>
>>>
>>>> My solution is using sg_set_page instead of sg_set_buf.
>>>>
>>>> I have tested the patch on my workstation. QEMU would not crash any more.
>>>>
>>>> Signed-off-by: Wang Sen <senwang@linux.vnet.ibm.com>
>>>> ---
>>>> drivers/scsi/virtio_scsi.c | 3 ++-
>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
>>>> index 1b38431..fc5c88a 100644
>>>> --- a/drivers/scsi/virtio_scsi.c
>>>> +++ b/drivers/scsi/virtio_scsi.c
>>>> @@ -198,7 +198,8 @@ static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
>>>> int i;
>>>>
>>>> for_each_sg(table->sgl, sg_elem, table->nents, i)
>>>> - sg_set_buf(&sg[idx++], sg_virt(sg_elem), sg_elem->length);
>>>> + sg_set_page(&sg[idx++], sg_page(sg_elem), sg_elem->length,
>>>> + sg_elem->offset);
>>>
>>> This can simply be
>>>
>>> sg[idx++] = *sg_elem;
>>>
>>
>> Yes, I saw your another E-mail. I think you're right. Simply calling
>> sg_set_page can not handle
>> the flag bits correctly. So, I'll repost the patch soon. Thank you!
>>
>
>
> No this code is correct, though you will need to make sure to properly
> terminate the destination sg_list.
Yes, the terminate marker in the destination list is set when initialization.
sg_set_page would not break this marker because it saved both the two
maker bits at sg_asign_page.
Also, the allocation of destination sg list considered the total number of
the source sg_list. So, sg_set_page can work correctly.
The value assignment method also can also work correctly, because the
termination marker in source sg_list has been set in blk_rq_map_sg(), as
the last entry of source sg_list is just copied to the the last entry
in destination
list.
Uh, Paolo, Boaz, have you reached agreement on which method to use?
>
> But since old code was using sg_set_buf(), than it means it was properly
> terminated before, and there for this code is good as is please don't
> touch it.
>
> Thanks
> Boaz
>
>>> Can you repost it with this change, and also add stable@vger.kernel.org
>>> to the Cc? Thanks very much!
>>>
>>> Paolo
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>
--
------------------------------------------
Wang Sen
Addr: XUPT,Xi'an,Shaanxi,China
Email: kelvin.xupt@gmail.com
------------------------------------------
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-26 13:05 ` Paolo Bonzini
@ 2012-07-27 6:27 ` Rusty Russell
2012-07-27 8:11 ` Paolo Bonzini
0 siblings, 1 reply; 34+ messages in thread
From: Rusty Russell @ 2012-07-27 6:27 UTC (permalink / raw)
To: Paolo Bonzini, Boaz Harrosh
Cc: Wang Sen, linux-scsi, JBottomley, stefanha, mc, linux-kernel,
kvm@vger.kernel.org
On Thu, 26 Jul 2012 15:05:39 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 26/07/2012 09:58, Paolo Bonzini ha scritto:
> >
> >> > Please CC me on the "convert to sg copy-less" patches, It looks interesting
> > Sure.
>
> Well, here is the gist of it (note it won't apply on any public tree,
> hence no SoB yet). It should be split in multiple changesets and you
> can make more simplifications on top of it, because
> virtio_scsi_target_state is not anymore variable-sized, but that's
> secondary.
ISTR starting on such a patch years ago, but the primitives to
manipulate a chained sg_list were nonexistent, so I dropped it,
waiting for it to be fully-baked or replaced. That hasn't happened:
> + /* Remove scatterlist terminator, we will tack more items soon. */
> + vblk->sg[num + out - 1].page_link &= ~0x2;
I hate this interface:
> +int virtqueue_add_buf_sg(struct virtqueue *_vq,
> + struct scatterlist *sg_out,
> + unsigned int out,
> + struct scatterlist *sg_in,
> + unsigned int in,
> + void *data,
> + gfp_t gfp)
The point of chained scatterlists is they're self-terminated, so the
in & out counts should be calculated.
Counting them is not *that* bad, since we're about to read them all
anyway.
(Yes, the chained scatterlist stuff is complete crack, but I lost that
debate years ago.)
Here's my variant. Networking, console and block seem OK, at least
(ie. it booted!).
From: Rusty Russell <rusty@rustcorp.com.au>
Subject: virtio: use chained scatterlists.
Rather than handing a scatterlist[] and out and in numbers to
virtqueue_add_buf(), hand two separate ones which can be chained.
I shall refrain from ranting about what a disgusting hack chained
scatterlists are. I'll just note that this doesn't make things
simpler (see diff).
The scatterlists we use can be too large for the stack, so we put them
in our device struct and reuse them. But in many cases we don't want
to pay the cost of sg_init_table() as we don't know how many elements
we'll have and we'd have to initialize the entire table.
This means we have two choices: carefully reset the end markers after
we call virtqueue_add_buf(), which we do in virtio_net for the xmit
path where it's easy and we want to be optimal. Elsewhere we
implement a helper to unset the end markers after we've filled the
array.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
drivers/block/virtio_blk.c | 37 +++++++++++++-----
drivers/char/hw_random/virtio-rng.c | 2 -
drivers/char/virtio_console.c | 6 +--
drivers/net/virtio_net.c | 67 ++++++++++++++++++---------------
drivers/virtio/virtio_balloon.c | 6 +--
drivers/virtio/virtio_ring.c | 71 ++++++++++++++++++++++--------------
include/linux/virtio.h | 5 +-
net/9p/trans_virtio.c | 46 +++++++++++++++++++++--
8 files changed, 159 insertions(+), 81 deletions(-)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -100,6 +100,14 @@ static void blk_done(struct virtqueue *v
spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
}
+static void sg_unset_end_markers(struct scatterlist *sg, unsigned int num)
+{
+ unsigned int i;
+
+ for (i = 0; i < num; i++)
+ sg[i].page_link &= ~0x02;
+}
+
static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
struct request *req)
{
@@ -140,6 +148,7 @@ static bool do_req(struct request_queue
}
}
+ /* We layout out scatterlist in a single array, out then in. */
sg_set_buf(&vblk->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
/*
@@ -151,17 +160,8 @@ static bool do_req(struct request_queue
if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC)
sg_set_buf(&vblk->sg[out++], vbr->req->cmd, vbr->req->cmd_len);
+ /* This marks the end of the sg list at vblk->sg[out]. */
num = blk_rq_map_sg(q, vbr->req, vblk->sg + out);
-
- if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
- sg_set_buf(&vblk->sg[num + out + in++], vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
- sg_set_buf(&vblk->sg[num + out + in++], &vbr->in_hdr,
- sizeof(vbr->in_hdr));
- }
-
- sg_set_buf(&vblk->sg[num + out + in++], &vbr->status,
- sizeof(vbr->status));
-
if (num) {
if (rq_data_dir(vbr->req) == WRITE) {
vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
@@ -172,7 +172,22 @@ static bool do_req(struct request_queue
}
}
- if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) {
+ if (vbr->req->cmd_type == REQ_TYPE_BLOCK_PC) {
+ sg_set_buf(&vblk->sg[out + in++], vbr->req->sense,
+ SCSI_SENSE_BUFFERSIZE);
+ sg_set_buf(&vblk->sg[out + in++], &vbr->in_hdr,
+ sizeof(vbr->in_hdr));
+ }
+
+ sg_set_buf(&vblk->sg[out + in++], &vbr->status,
+ sizeof(vbr->status));
+
+ sg_unset_end_markers(vblk->sg, out+in);
+ sg_mark_end(&vblk->sg[out-1]);
+ sg_mark_end(&vblk->sg[out+in-1]);
+
+ if (virtqueue_add_buf(vblk->vq, vblk->sg, vblk->sg+out, vbr, GFP_ATOMIC)
+ < 0) {
mempool_free(vbr, vblk->pool);
return false;
}
diff --git a/drivers/char/hw_random/virtio-rng.c b/drivers/char/hw_random/virtio-rng.c
--- a/drivers/char/hw_random/virtio-rng.c
+++ b/drivers/char/hw_random/virtio-rng.c
@@ -47,7 +47,7 @@ static void register_buffer(u8 *buf, siz
sg_init_one(&sg, buf, size);
/* There should always be room for one buffer. */
- if (virtqueue_add_buf(vq, &sg, 0, 1, buf, GFP_KERNEL) < 0)
+ if (virtqueue_add_buf(vq, NULL, &sg, buf, GFP_KERNEL) < 0)
BUG();
virtqueue_kick(vq);
diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c
--- a/drivers/char/virtio_console.c
+++ b/drivers/char/virtio_console.c
@@ -392,7 +392,7 @@ static int add_inbuf(struct virtqueue *v
sg_init_one(sg, buf->buf, buf->size);
- ret = virtqueue_add_buf(vq, sg, 0, 1, buf, GFP_ATOMIC);
+ ret = virtqueue_add_buf(vq, NULL, sg, buf, GFP_ATOMIC);
virtqueue_kick(vq);
return ret;
}
@@ -457,7 +457,7 @@ static ssize_t __send_control_msg(struct
vq = portdev->c_ovq;
sg_init_one(sg, &cpkt, sizeof(cpkt));
- if (virtqueue_add_buf(vq, sg, 1, 0, &cpkt, GFP_ATOMIC) >= 0) {
+ if (virtqueue_add_buf(vq, sg, NULL, &cpkt, GFP_ATOMIC) >= 0) {
virtqueue_kick(vq);
while (!virtqueue_get_buf(vq, &len))
cpu_relax();
@@ -506,7 +506,7 @@ static ssize_t send_buf(struct port *por
reclaim_consumed_buffers(port);
sg_init_one(sg, in_buf, in_count);
- ret = virtqueue_add_buf(out_vq, sg, 1, 0, in_buf, GFP_ATOMIC);
+ ret = virtqueue_add_buf(out_vq, sg, NULL, in_buf, GFP_ATOMIC);
/* Tell Host to go! */
virtqueue_kick(out_vq);
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -376,11 +376,11 @@ static int add_recvbuf_small(struct virt
skb_put(skb, MAX_PACKET_LEN);
hdr = skb_vnet_hdr(skb);
+ sg_init_table(vi->rx_sg, 2);
sg_set_buf(vi->rx_sg, &hdr->hdr, sizeof hdr->hdr);
-
skb_to_sgvec(skb, vi->rx_sg + 1, 0, skb->len);
- err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 2, skb, gfp);
+ err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, skb, gfp);
if (err < 0)
dev_kfree_skb(skb);
@@ -393,6 +393,8 @@ static int add_recvbuf_big(struct virtne
char *p;
int i, err, offset;
+ sg_init_table(vi->rx_sg, MAX_SKB_FRAGS + 1);
+
/* page in vi->rx_sg[MAX_SKB_FRAGS + 1] is list tail */
for (i = MAX_SKB_FRAGS + 1; i > 1; --i) {
first = get_a_page(vi, gfp);
@@ -425,8 +427,8 @@ static int add_recvbuf_big(struct virtne
/* chain first in list head */
first->private = (unsigned long)list;
- err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, MAX_SKB_FRAGS + 2,
- first, gfp);
+
+ err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, first, gfp);
if (err < 0)
give_pages(vi, first);
@@ -444,7 +446,7 @@ static int add_recvbuf_mergeable(struct
sg_init_one(vi->rx_sg, page_address(page), PAGE_SIZE);
- err = virtqueue_add_buf(vi->rvq, vi->rx_sg, 0, 1, page, gfp);
+ err = virtqueue_add_buf(vi->rvq, NULL, vi->rx_sg, page, gfp);
if (err < 0)
give_pages(vi, page);
@@ -581,6 +583,7 @@ static int xmit_skb(struct virtnet_info
{
struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
+ int ret;
pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
@@ -620,8 +623,16 @@ static int xmit_skb(struct virtnet_info
sg_set_buf(vi->tx_sg, &hdr->hdr, sizeof hdr->hdr);
hdr->num_sg = skb_to_sgvec(skb, vi->tx_sg + 1, 0, skb->len) + 1;
- return virtqueue_add_buf(vi->svq, vi->tx_sg, hdr->num_sg,
- 0, skb, GFP_ATOMIC);
+
+ ret = virtqueue_add_buf(vi->svq, vi->tx_sg, NULL, skb, GFP_ATOMIC);
+
+ /*
+ * An optimization: clear the end bit set by skb_to_sgvec, so
+ * we can simply re-use vi->tx_sg[] next time.
+ */
+ vi->tx_sg[hdr->num_sg-1].page_link &= ~0x02;
+
+ return ret;
}
static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -757,32 +768,31 @@ static int virtnet_open(struct net_devic
* never fail unless improperly formated.
*/
static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
- struct scatterlist *data, int out, int in)
+ struct scatterlist *cmdsg)
{
- struct scatterlist *s, sg[VIRTNET_SEND_COMMAND_SG_MAX + 2];
+ struct scatterlist in[1], out[2];
struct virtio_net_ctrl_hdr ctrl;
virtio_net_ctrl_ack status = ~0;
unsigned int tmp;
- int i;
/* Caller should know better */
- BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ) ||
- (out + in > VIRTNET_SEND_COMMAND_SG_MAX));
+ BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
- out++; /* Add header */
- in++; /* Add return status */
-
+ /* Prepend header to output. */
+ sg_init_table(out, 2);
ctrl.class = class;
ctrl.cmd = cmd;
+ sg_set_buf(&out[0], &ctrl, sizeof(ctrl));
+ if (cmdsg)
+ sg_chain(out, 2, cmdsg);
+ else
+ sg_mark_end(&out[0]);
- sg_init_table(sg, out + in);
+ /* Status response. */
+ sg_init_one(in, &status, sizeof(status));
- sg_set_buf(&sg[0], &ctrl, sizeof(ctrl));
- for_each_sg(data, s, out + in - 2, i)
- sg_set_buf(&sg[i + 1], sg_virt(s), s->length);
- sg_set_buf(&sg[out + in - 1], &status, sizeof(status));
- BUG_ON(virtqueue_add_buf(vi->cvq, sg, out, in, vi, GFP_ATOMIC) < 0);
+ BUG_ON(virtqueue_add_buf(vi->cvq, out, in, vi, GFP_ATOMIC) < 0);
virtqueue_kick(vi->cvq);
@@ -800,8 +810,7 @@ static void virtnet_ack_link_announce(st
{
rtnl_lock();
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_ANNOUNCE,
- VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL,
- 0, 0))
+ VIRTIO_NET_CTRL_ANNOUNCE_ACK, NULL))
dev_warn(&vi->dev->dev, "Failed to ack link announce.\n");
rtnl_unlock();
}
@@ -839,16 +848,14 @@ static void virtnet_set_rx_mode(struct n
sg_init_one(sg, &promisc, sizeof(promisc));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
- VIRTIO_NET_CTRL_RX_PROMISC,
- sg, 1, 0))
+ VIRTIO_NET_CTRL_RX_PROMISC, sg))
dev_warn(&dev->dev, "Failed to %sable promisc mode.\n",
promisc ? "en" : "dis");
sg_init_one(sg, &allmulti, sizeof(allmulti));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
- VIRTIO_NET_CTRL_RX_ALLMULTI,
- sg, 1, 0))
+ VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n",
allmulti ? "en" : "dis");
@@ -887,7 +894,7 @@ static void virtnet_set_rx_mode(struct n
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MAC,
VIRTIO_NET_CTRL_MAC_TABLE_SET,
- sg, 2, 0))
+ sg))
dev_warn(&dev->dev, "Failed to set MAC fitler table.\n");
kfree(buf);
@@ -901,7 +908,7 @@ static int virtnet_vlan_rx_add_vid(struc
sg_init_one(&sg, &vid, sizeof(vid));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
- VIRTIO_NET_CTRL_VLAN_ADD, &sg, 1, 0))
+ VIRTIO_NET_CTRL_VLAN_ADD, &sg))
dev_warn(&dev->dev, "Failed to add VLAN ID %d.\n", vid);
return 0;
}
@@ -914,7 +921,7 @@ static int virtnet_vlan_rx_kill_vid(stru
sg_init_one(&sg, &vid, sizeof(vid));
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_VLAN,
- VIRTIO_NET_CTRL_VLAN_DEL, &sg, 1, 0))
+ VIRTIO_NET_CTRL_VLAN_DEL, &sg))
dev_warn(&dev->dev, "Failed to kill VLAN ID %d.\n", vid);
return 0;
}
diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -102,7 +102,7 @@ static void tell_host(struct virtio_ball
sg_init_one(&sg, vb->pfns, sizeof(vb->pfns[0]) * vb->num_pfns);
/* We should always be able to add one buffer to an empty queue. */
- if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+ if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
BUG();
virtqueue_kick(vq);
@@ -246,7 +246,7 @@ static void stats_handle_request(struct
if (!virtqueue_get_buf(vq, &len))
return;
sg_init_one(&sg, vb->stats, sizeof(vb->stats));
- if (virtqueue_add_buf(vq, &sg, 1, 0, vb, GFP_KERNEL) < 0)
+ if (virtqueue_add_buf(vq, &sg, NULL, vb, GFP_KERNEL) < 0)
BUG();
virtqueue_kick(vq);
}
@@ -331,7 +331,7 @@ static int init_vqs(struct virtio_balloo
* use it to signal us later.
*/
sg_init_one(&sg, vb->stats, sizeof vb->stats);
- if (virtqueue_add_buf(vb->stats_vq, &sg, 1, 0, vb, GFP_KERNEL)
+ if (virtqueue_add_buf(vb->stats_vq, &sg, NULL, vb, GFP_KERNEL)
< 0)
BUG();
virtqueue_kick(vb->stats_vq);
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -121,35 +121,41 @@ struct vring_virtqueue
#define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
+/* This doesn't have the counter that for_each_sg() has */
+#define foreach_sg(sglist, i) \
+ for (i = (sglist); i; i = sg_next(i))
+
/* Set up an indirect table of descriptors and add it to the queue. */
static int vring_add_indirect(struct vring_virtqueue *vq,
- struct scatterlist sg[],
- unsigned int out,
- unsigned int in,
+ unsigned int num,
+ const struct scatterlist *out,
+ const struct scatterlist *in,
gfp_t gfp)
{
+ const struct scatterlist *sg;
struct vring_desc *desc;
unsigned head;
int i;
- desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
+ desc = kmalloc(num * sizeof(struct vring_desc), gfp);
if (!desc)
return -ENOMEM;
/* Transfer entries from the sg list into the indirect page */
- for (i = 0; i < out; i++) {
+ i = 0;
+ foreach_sg(out, sg) {
desc[i].flags = VRING_DESC_F_NEXT;
desc[i].addr = sg_phys(sg);
desc[i].len = sg->length;
desc[i].next = i+1;
- sg++;
+ i++;
}
- for (; i < (out + in); i++) {
+ foreach_sg(in, sg) {
desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
desc[i].addr = sg_phys(sg);
desc[i].len = sg->length;
desc[i].next = i+1;
- sg++;
+ i++;
}
/* Last one doesn't continue. */
@@ -171,12 +177,21 @@ static int vring_add_indirect(struct vri
return head;
}
+static unsigned int count_sg(const struct scatterlist *sg)
+{
+ unsigned int count = 0;
+ const struct scatterlist *i;
+
+ foreach_sg(sg, i)
+ count++;
+ return count;
+}
+
/**
* virtqueue_add_buf - expose buffer to other end
* @vq: the struct virtqueue we're talking about.
- * @sg: the description of the buffer(s).
- * @out_num: the number of sg readable by other side
- * @in_num: the number of sg which are writable (after readable ones)
+ * @out: the description of the output buffer(s).
+ * @in: the description of the input buffer(s).
* @data: the token identifying the buffer.
* @gfp: how to do memory allocations (if necessary).
*
@@ -189,20 +204,23 @@ static int vring_add_indirect(struct vri
* we can put an entire sg[] array inside a single queue entry.
*/
int virtqueue_add_buf(struct virtqueue *_vq,
- struct scatterlist sg[],
- unsigned int out,
- unsigned int in,
+ const struct scatterlist *out,
+ const struct scatterlist *in,
void *data,
gfp_t gfp)
{
struct vring_virtqueue *vq = to_vvq(_vq);
- unsigned int i, avail, uninitialized_var(prev);
+ unsigned int i, avail, uninitialized_var(prev), num;
+ const struct scatterlist *sg;
int head;
START_USE(vq);
BUG_ON(data == NULL);
+ num = count_sg(out) + count_sg(in);
+ BUG_ON(num == 0);
+
#ifdef DEBUG
{
ktime_t now = ktime_get();
@@ -218,18 +236,17 @@ int virtqueue_add_buf(struct virtqueue *
/* If the host supports indirect descriptor tables, and we have multiple
* buffers, then go indirect. FIXME: tune this threshold */
- if (vq->indirect && (out + in) > 1 && vq->num_free) {
- head = vring_add_indirect(vq, sg, out, in, gfp);
+ if (vq->indirect && num > 1 && vq->num_free) {
+ head = vring_add_indirect(vq, num, out, in, gfp);
if (likely(head >= 0))
goto add_head;
}
- BUG_ON(out + in > vq->vring.num);
- BUG_ON(out + in == 0);
+ BUG_ON(num > vq->vring.num);
- if (vq->num_free < out + in) {
+ if (vq->num_free < num) {
pr_debug("Can't add buf len %i - avail = %i\n",
- out + in, vq->num_free);
+ num, vq->num_free);
/* FIXME: for historical reasons, we force a notify here if
* there are outgoing parts to the buffer. Presumably the
* host should service the ring ASAP. */
@@ -240,22 +257,24 @@ int virtqueue_add_buf(struct virtqueue *
}
/* We're about to use some buffers from the free list. */
- vq->num_free -= out + in;
+ vq->num_free -= num;
head = vq->free_head;
- for (i = vq->free_head; out; i = vq->vring.desc[i].next, out--) {
+
+ i = vq->free_head;
+ foreach_sg(out, sg) {
vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
vq->vring.desc[i].addr = sg_phys(sg);
vq->vring.desc[i].len = sg->length;
prev = i;
- sg++;
+ i = vq->vring.desc[i].next;
}
- for (; in; i = vq->vring.desc[i].next, in--) {
+ foreach_sg(in, sg) {
vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
vq->vring.desc[i].addr = sg_phys(sg);
vq->vring.desc[i].len = sg->length;
prev = i;
- sg++;
+ i = vq->vring.desc[i].next;
}
/* Last one doesn't continue. */
vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -26,9 +26,8 @@ struct virtqueue {
};
int virtqueue_add_buf(struct virtqueue *vq,
- struct scatterlist sg[],
- unsigned int out_num,
- unsigned int in_num,
+ const struct scatterlist *out,
+ const struct scatterlist *in,
void *data,
gfp_t gfp);
diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -244,6 +244,14 @@ pack_sg_list_p(struct scatterlist *sg, i
return index - start;
}
+static void sg_unset_end_markers(struct scatterlist *sg, unsigned int num)
+{
+ unsigned int i;
+
+ for (i = 0; i < num; i++)
+ sg[i].page_link &= ~0x02;
+}
+
/**
* p9_virtio_request - issue a request
* @client: client instance issuing the request
@@ -258,6 +266,7 @@ p9_virtio_request(struct p9_client *clie
int in, out;
unsigned long flags;
struct virtio_chan *chan = client->trans;
+ const struct scatterlist *outsg = NULL, *insg = NULL;
p9_debug(P9_DEBUG_TRANS, "9p debug: virtio request\n");
@@ -268,12 +277,21 @@ req_retry:
/* Handle out VirtIO ring buffers */
out = pack_sg_list(chan->sg, 0,
VIRTQUEUE_NUM, req->tc->sdata, req->tc->size);
+ if (out) {
+ sg_unset_end_markers(chan->sg, out-1);
+ sg_mark_end(&chan->sg[out-1]);
+ outsg = chan->sg;
+ }
in = pack_sg_list(chan->sg, out,
VIRTQUEUE_NUM, req->rc->sdata, req->rc->capacity);
+ if (in) {
+ sg_unset_end_markers(chan->sg+out, in-1);
+ sg_mark_end(&chan->sg[out+in-1]);
+ insg = chan->sg+out;
+ }
- err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
- GFP_ATOMIC);
+ err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
if (err < 0) {
if (err == -ENOSPC) {
chan->ring_bufs_avail = 0;
@@ -355,6 +377,7 @@ p9_virtio_zc_request(struct p9_client *c
int in_nr_pages = 0, out_nr_pages = 0;
struct page **in_pages = NULL, **out_pages = NULL;
struct virtio_chan *chan = client->trans;
+ struct scatterlist *insg = NULL, *outsg = NULL;
p9_debug(P9_DEBUG_TRANS, "virtio request\n");
@@ -402,6 +425,13 @@ req_retry_pinned:
if (out_pages)
out += pack_sg_list_p(chan->sg, out, VIRTQUEUE_NUM,
out_pages, out_nr_pages, uodata, outlen);
+
+ if (out) {
+ sg_unset_end_markers(chan->sg, out-1);
+ sg_mark_end(&chan->sg[out-1]);
+ outsg = chan->sg;
+ }
+
/*
* Take care of in data
* For example TREAD have 11.
@@ -414,9 +446,13 @@ req_retry_pinned:
if (in_pages)
in += pack_sg_list_p(chan->sg, out + in, VIRTQUEUE_NUM,
in_pages, in_nr_pages, uidata, inlen);
+ if (in) {
+ sg_unset_end_markers(chan->sg+out, in-1);
+ sg_mark_end(&chan->sg[out+in-1]);
+ insg = chan->sg + out;
+ }
- err = virtqueue_add_buf(chan->vq, chan->sg, out, in, req->tc,
- GFP_ATOMIC);
+ err = virtqueue_add_buf(chan->vq, outsg, insg, req->tc, GFP_ATOMIC);
if (err < 0) {
if (err == -ENOSPC) {
chan->ring_bufs_avail = 0;
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list
2012-07-27 3:12 ` Wang Sen
@ 2012-07-27 6:50 ` Paolo Bonzini
0 siblings, 0 replies; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-27 6:50 UTC (permalink / raw)
To: Wang Sen
Cc: Boaz Harrosh, Wang Sen, linux-scsi, JBottomley, stefanha, mc,
linux-kernel
Il 27/07/2012 05:12, Wang Sen ha scritto:
>> > No this code is correct, though you will need to make sure to properly
>> > terminate the destination sg_list.
> Yes, the terminate marker in the destination list is set when initialization.
> sg_set_page would not break this marker because it saved both the two
> maker bits at sg_asign_page.
>
> Also, the allocation of destination sg list considered the total number of
> the source sg_list. So, sg_set_page can work correctly.
>
> The value assignment method also can also work correctly, because the
> termination marker in source sg_list has been set in blk_rq_map_sg(), as
> the last entry of source sg_list is just copied to the the last entry
> in destination
> list.
>
> Uh, Paolo, Boaz, have you reached agreement on which method to use?
Let's use the value assignment, and document it in the commit message as
follows:
Value assignment creates a well-formed scatterlist, because the
termination marker in source sg_list has been set in blk_rq_map_sg().
The last entry of the source sg_list is just copied to the the last
entry in destination list. Note that, for now, virtio_ring does not
care about the form of the scatterlist and simply processes the first
out_num + in_num consecutive elements of the sg[] array.
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-27 6:27 ` Rusty Russell
@ 2012-07-27 8:11 ` Paolo Bonzini
2012-07-29 23:50 ` Rusty Russell
0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-27 8:11 UTC (permalink / raw)
To: Rusty Russell
Cc: Boaz Harrosh, Wang Sen, linux-scsi, JBottomley, stefanha, mc,
linux-kernel, kvm@vger.kernel.org
Il 27/07/2012 08:27, Rusty Russell ha scritto:
>> > +int virtqueue_add_buf_sg(struct virtqueue *_vq,
>> > + struct scatterlist *sg_out,
>> > + unsigned int out,
>> > + struct scatterlist *sg_in,
>> > + unsigned int in,
>> > + void *data,
>> > + gfp_t gfp)
> The point of chained scatterlists is they're self-terminated, so the
> in & out counts should be calculated.
>
> Counting them is not *that* bad, since we're about to read them all
> anyway.
>
> (Yes, the chained scatterlist stuff is complete crack, but I lost that
> debate years ago.)
>
> Here's my variant. Networking, console and block seem OK, at least
> (ie. it booted!).
I hate the for loops, even though we're about indeed to read all the
scatterlists anyway... all they do is lengthen critical sections. Also,
being the first user of chained scatterlist doesn't exactly give me warm
fuzzies.
I think it's simpler if we provide an API to add individual buffers to
the virtqueue, so that you can do multiple virtqueue_add_buf_more
(whatever) before kicking the virtqueue. The idea is that I can still
use indirect buffers for the scatterlists that come from the block layer
or from an skb, but I will use direct buffers for the request/response
descriptors. The direct buffers are always a small number (usually 2),
so you can balance the effect by making the virtqueue bigger. And for
small reads and writes, you save a kmalloc on a very hot path.
(BTW, scatterlists will have separate entries for each page; we do not
need this in virtio buffers. Collapsing physically-adjacent entries
will speed up QEMU and will also help avoiding indirect buffers).
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-27 8:11 ` Paolo Bonzini
@ 2012-07-29 23:50 ` Rusty Russell
2012-07-30 7:12 ` Paolo Bonzini
0 siblings, 1 reply; 34+ messages in thread
From: Rusty Russell @ 2012-07-29 23:50 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Boaz Harrosh, Wang Sen, linux-scsi, JBottomley, stefanha, mc,
linux-kernel, kvm@vger.kernel.org
On Fri, 27 Jul 2012 10:11:26 +0200, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 27/07/2012 08:27, Rusty Russell ha scritto:
> >> > +int virtqueue_add_buf_sg(struct virtqueue *_vq,
> >> > + struct scatterlist *sg_out,
> >> > + unsigned int out,
> >> > + struct scatterlist *sg_in,
> >> > + unsigned int in,
> >> > + void *data,
> >> > + gfp_t gfp)
> > The point of chained scatterlists is they're self-terminated, so the
> > in & out counts should be calculated.
> >
> > Counting them is not *that* bad, since we're about to read them all
> > anyway.
> >
> > (Yes, the chained scatterlist stuff is complete crack, but I lost that
> > debate years ago.)
> >
> > Here's my variant. Networking, console and block seem OK, at least
> > (ie. it booted!).
>
> I hate the for loops, even though we're about indeed to read all the
> scatterlists anyway... all they do is lengthen critical sections.
You're preaching to the choir: I agree. But even more, I hate the
passing of a number and a scatterlist: it makes it doubly ambigious
whether we should use the number or the terminator. And I really hate
bad APIs, even more than a bit of icache loss.
> Also, being the first user of chained scatterlist doesn't exactly give
> me warm fuzzies.
We're far from the first user: they've been in the kernel for well over
7 years. They were introduced for the block layer, but they tended to
ignore private uses of scatterlists like this one.
> I think it's simpler if we provide an API to add individual buffers to
> the virtqueue, so that you can do multiple virtqueue_add_buf_more
> (whatever) before kicking the virtqueue. The idea is that I can still
> use indirect buffers for the scatterlists that come from the block layer
> or from an skb, but I will use direct buffers for the request/response
> descriptors. The direct buffers are always a small number (usually 2),
> so you can balance the effect by making the virtqueue bigger. And for
> small reads and writes, you save a kmalloc on a very hot path.
This is why I hate chained scatterlists: there's no sane way to tell the
difference between passing a simple array and passing a chain. We're
mugging the type system.
I think the right way of doing this is a flag. We could abuse gfp_t,
but that's super-ugly. Perhaps we should create our own
VQ_ATOMIC/VQ_KERNEL/VQ_DIRECT enum?
Or we could do what we previously suggested, and try to do some
intelligent size heuristic. I've posted a patch from 3 years ago below.
> (BTW, scatterlists will have separate entries for each page; we do not
> need this in virtio buffers. Collapsing physically-adjacent entries
> will speed up QEMU and will also help avoiding indirect buffers).
Yes, we should do this. But note that this means an iteration, so we
might as well combine the loops :)
Cheers,
Rusty.
FIXME: remove printk
virtio: use indirect buffers based on demand (heuristic)
virtio_ring uses a ring buffer of descriptors: indirect support allows
a single descriptor to refer to a table of descriptors. This saves
space in the ring, but requires a kmalloc/kfree.
Rather than try to figure out what the right threshold at which to use
indirect buffers, we drop the threshold dynamically when the ring is
under stress.
Note: to stress this, I reduced the ring size to 32 in lguest, and a
1G send reduced the threshold to 9.
Note2: I moved the BUG_ON()s above the indirect test, where they belong
(indirect falls thru on OOM, so the constraints still apply).
---
drivers/virtio/virtio_ring.c | 61 ++++++++++++++++++++++++++++++++++++-------
1 file changed, 52 insertions(+), 9 deletions(-)
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -63,6 +63,8 @@ struct vring_virtqueue
/* Host supports indirect buffers */
bool indirect;
+ /* Threshold before we go indirect. */
+ unsigned int indirect_threshold;
/* Number of free buffers */
unsigned int num_free;
@@ -139,6 +141,32 @@ static int vring_add_indirect(struct vri
return head;
}
+static void adjust_threshold(struct vring_virtqueue *vq,
+ unsigned int out, unsigned int in)
+{
+ /* There are really two species of virtqueue, and it matters here.
+ * If there are no output parts, it's a "normally full" receive queue,
+ * otherwise it's a "normally empty" send queue. */
+ if (out) {
+ /* Leave threshold unless we're full. */
+ if (out + in < vq->num_free)
+ return;
+ } else {
+ /* Leave threshold unless we're empty. */
+ if (vq->num_free != vq->vring.num)
+ return;
+ }
+
+ /* Never drop threshold below 1 */
+ vq->indirect_threshold /= 2;
+ vq->indirect_threshold |= 1;
+
+ printk("%s %s: indirect threshold %u (%u+%u vs %u)\n",
+ dev_name(&vq->vq.vdev->dev),
+ vq->vq.name, vq->indirect_threshold,
+ out, in, vq->num_free);
+}
+
static int vring_add_buf(struct virtqueue *_vq,
struct scatterlist sg[],
unsigned int out,
@@ -151,19 +179,33 @@ static int vring_add_buf(struct virtqueu
START_USE(vq);
BUG_ON(data == NULL);
-
- /* If the host supports indirect descriptor tables, and we have multiple
- * buffers, then go indirect. FIXME: tune this threshold */
- if (vq->indirect && (out + in) > 1 && vq->num_free) {
- head = vring_add_indirect(vq, sg, out, in);
- if (head != vq->vring.num)
- goto add_head;
- }
-
BUG_ON(out + in > vq->vring.num);
BUG_ON(out + in == 0);
vq->addbuf_total++;
+
+ /* If the host supports indirect descriptor tables, consider it. */
+ if (vq->indirect) {
+ bool try_indirect;
+
+ /* We tweak the threshold automatically. */
+ adjust_threshold(vq, out, in);
+
+ /* If we can't fit any at all, fall through. */
+ if (vq->num_free == 0)
+ try_indirect = false;
+ else if (out + in > vq->num_free)
+ try_indirect = true;
+ else
+ try_indirect = (out + in > vq->indirect_threshold);
+
+ if (try_indirect) {
+ head = vring_add_indirect(vq, sg, out, in);
+ if (head != vq->vring.num)
+ goto add_head;
+ }
+ }
+
if (vq->num_free < out + in) {
pr_debug("Can't add buf len %i - avail = %i\n",
out + in, vq->num_free);
@@ -401,6 +443,7 @@ struct virtqueue *vring_new_virtqueue(un
= vq->other_notify = 0;
vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
+ vq->indirect_threshold = num;
/* No callback? Tell other side not to bother us. */
if (!callback)
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-29 23:50 ` Rusty Russell
@ 2012-07-30 7:12 ` Paolo Bonzini
2012-07-30 8:56 ` Boaz Harrosh
0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2012-07-30 7:12 UTC (permalink / raw)
To: Rusty Russell
Cc: Boaz Harrosh, Wang Sen, linux-scsi, JBottomley, stefanha, mc,
linux-kernel, kvm@vger.kernel.org
Il 30/07/2012 01:50, Rusty Russell ha scritto:
>> Also, being the first user of chained scatterlist doesn't exactly give
>> me warm fuzzies.
>
> We're far from the first user: they've been in the kernel for well over
> 7 years. They were introduced for the block layer, but they tended to
> ignore private uses of scatterlists like this one.
Yeah, but sg_chain has no users in drivers, only a private one in
lib/scatterlist.c. The internal API could be changed to something else
and leave virtio-scsi screwed...
> Yes, we should do this. But note that this means an iteration, so we
> might as well combine the loops :)
I'm really bad at posting pseudo-code, but you can count the number of
physically-contiguous entries at the beginning of the list only. So if
everything is contiguous, you use a single non-indirect buffer and save
a kmalloc. If you use indirect buffers, I suspect it's much less
effective to collapse physically-contiguous entries. More elaborate
heuristics do need a loop, though.
Paolo
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)
2012-07-30 7:12 ` Paolo Bonzini
@ 2012-07-30 8:56 ` Boaz Harrosh
0 siblings, 0 replies; 34+ messages in thread
From: Boaz Harrosh @ 2012-07-30 8:56 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Rusty Russell, Wang Sen, linux-scsi, JBottomley, stefanha, mc,
linux-kernel, kvm@vger.kernel.org
On 07/30/2012 10:12 AM, Paolo Bonzini wrote:
> Il 30/07/2012 01:50, Rusty Russell ha scritto:
>>> Also, being the first user of chained scatterlist doesn't exactly give
>>> me warm fuzzies.
>>
>> We're far from the first user: they've been in the kernel for well over
>> 7 years. They were introduced for the block layer, but they tended to
>> ignore private uses of scatterlists like this one.
>
> Yeah, but sg_chain has no users in drivers, only a private one in
> lib/scatterlist.c. The internal API could be changed to something else
> and leave virtio-scsi screwed...
>
>> Yes, we should do this. But note that this means an iteration, so we
>> might as well combine the loops :)
>
> I'm really bad at posting pseudo-code, but you can count the number of
> physically-contiguous entries at the beginning of the list only. So if
> everything is contiguous, you use a single non-indirect buffer and save
> a kmalloc. If you use indirect buffers, I suspect it's much less
> effective to collapse physically-contiguous entries. More elaborate
> heuristics do need a loop, though.
>
[All the below with a grain of salt, from my senile memory]
You must not forget some facts about the scatterlist received here at the LLD.
It has already been DMA mapped and locked by the generic layer.
Which means that the DMA engine has already collapsed physically-contiguous
entries. Those you get here are already unique physically.
(There were bugs in the past, where this was not true, please complain
if you find them again)
A scatterlist is two different lists taking the same space, but with two
different length.
- One list is the PAGE pointers plus offset && length, which is bigger or
equal to the 2nd list. The end marker corresponds to this list.
This list is the input into the DMA engine.
- Second list is the physical DMA addresses list. With their physical-lengths.
Offset is not needed because it is incorporated in the DMA address.
This list is the output from the DMA engine.
The reason 2nd list is shorter is because the DMA engine tries to minimize
the physical scatter-list entries which is usually a limited HW resource.
This list might follow chains but it's end is determined by the received
sg_count from the DMA engine, not by the end marker.
At the time my opinion, and I think Rusty agreed, was that the scatterlist
should be split in two. The input page-ptr list is just the BIO, and the
output of the DMA-engine should just be the physical part of the sg_list,
as a separate parameter. But all this was berried under too much APIs and
the noise was two strong, for any single brave sole.
So I'd just trust blindly the returned sg_count from the DMA engine, it is
already optimized. I THINK
> Paolo
Boaz
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2012-07-30 8:57 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-07-25 8:29 [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list Wang Sen
2012-07-25 8:44 ` Paolo Bonzini
2012-07-25 9:22 ` Boaz Harrosh
2012-07-25 9:41 ` Paolo Bonzini
2012-07-25 12:34 ` Boaz Harrosh
2012-07-25 12:49 ` Paolo Bonzini
2012-07-25 13:26 ` Boaz Harrosh
2012-07-25 13:36 ` Paolo Bonzini
2012-07-25 14:36 ` Boaz Harrosh
2012-07-25 15:09 ` performance improvements for the sglist API (Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list) Paolo Bonzini
2012-07-25 15:16 ` Paolo Bonzini
2012-07-25 14:17 ` virtio(-scsi) vs. chained sg_lists (was " Paolo Bonzini
2012-07-25 15:28 ` Boaz Harrosh
2012-07-25 17:43 ` Paolo Bonzini
2012-07-25 19:16 ` Boaz Harrosh
2012-07-25 20:06 ` Paolo Bonzini
2012-07-25 21:04 ` Boaz Harrosh
2012-07-26 7:23 ` Paolo Bonzini
2012-07-26 7:56 ` Boaz Harrosh
2012-07-26 7:58 ` Paolo Bonzini
2012-07-26 13:05 ` Paolo Bonzini
2012-07-27 6:27 ` Rusty Russell
2012-07-27 8:11 ` Paolo Bonzini
2012-07-29 23:50 ` Rusty Russell
2012-07-30 7:12 ` Paolo Bonzini
2012-07-30 8:56 ` Boaz Harrosh
2012-07-25 10:41 ` [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list Stefan Hajnoczi
2012-07-25 11:48 ` Sen Wang
2012-07-25 11:44 ` Sen Wang
2012-07-25 12:40 ` Boaz Harrosh
2012-07-27 3:12 ` Wang Sen
2012-07-27 6:50 ` Paolo Bonzini
2012-07-25 10:04 ` Rolf Eike Beer
2012-07-25 11:46 ` Sen Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).