All of lore.kernel.org
 help / color / mirror / Atom feed
* [Lustre-devel] discontiguous kiov pages
@ 2011-06-02  2:33 wang
  2011-06-02 13:19 ` Eric Barton
  0 siblings, 1 reply; 8+ messages in thread
From: wang @ 2011-06-02  2:33 UTC (permalink / raw)
  To: lustre-devel

Our gnilnd is running into a hole in kiov list in Lustre 2.1:

LustreError: 17837:0:(gnilnd_cb.c:594:kgnilnd_setup_phys_buffer()) Can't make payload
contiguous in I/O VM:page 17, offset 0, nob 6350, kiov_offset 0 kiov_len 2254
LustreError: 17837:0:(gnilnd_cb.c:1751:kgnilnd_send()) unable to setup buffer: -22

Is it now legal for an internal IOV (or KIOV) page to have less than a full page size ?

It used to be that only the first and last page in an IOV were allowed 
to be of a offset + length < PAGE_SIZE.

It doesn't have this problem with 1.8 client and 2.1 server.

Wally

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] discontiguous kiov pages
  2011-06-02  2:33 [Lustre-devel] discontiguous kiov pages wang
@ 2011-06-02 13:19 ` Eric Barton
  2011-06-07 23:57   ` Oleg Drokin
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Barton @ 2011-06-02 13:19 UTC (permalink / raw)
  To: lustre-devel

Wang,

Inline...

> -----Original Message-----
> From: lustre-devel-bounces at lists.lustre.org [mailto:lustre-devel-bounces at lists.lustre.org] On Behalf
> Of wang
> Sent: 02 June 2011 3:34 AM
> To: lustre-devel at lists.lustre.org
> Subject: [Lustre-devel] discontiguous kiov pages
> 
> Our gnilnd is running into a hole in kiov list in Lustre 2.1:
> 
> LustreError: 17837:0:(gnilnd_cb.c:594:kgnilnd_setup_phys_buffer()) Can't make payload
> contiguous in I/O VM:page 17, offset 0, nob 6350, kiov_offset 0 kiov_len 2254
> LustreError: 17837:0:(gnilnd_cb.c:1751:kgnilnd_send()) unable to setup buffer: -22
> 
> Is it now legal for an internal IOV (or KIOV) page to have less than a full page size ?
> 
> It used to be that only the first and last page in an IOV were allowed
> to be of a offset + length < PAGE_SIZE.

Quite correct.  LNDs have relied on this for years now.
A change like this should not have occurred without discussion
about the wider impact.

> It doesn't have this problem with 1.8 client and 2.1 server.
> 
> Wally
> 
> 
> 
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] discontiguous kiov pages
  2011-06-02 13:19 ` Eric Barton
@ 2011-06-07 23:57   ` Oleg Drokin
  2011-06-08 16:08     ` Nic Henke
  0 siblings, 1 reply; 8+ messages in thread
From: Oleg Drokin @ 2011-06-07 23:57 UTC (permalink / raw)
  To: lustre-devel

Hello!

On Jun 2, 2011, at 9:19 AM, Eric Barton wrote:
>> Our gnilnd is running into a hole in kiov list in Lustre 2.1:
>> 
>> LustreError: 17837:0:(gnilnd_cb.c:594:kgnilnd_setup_phys_buffer()) Can't make payload
>> contiguous in I/O VM:page 17, offset 0, nob 6350, kiov_offset 0 kiov_len 2254
>> LustreError: 17837:0:(gnilnd_cb.c:1751:kgnilnd_send()) unable to setup buffer: -22
>> 
>> Is it now legal for an internal IOV (or KIOV) page to have less than a full page size ?
>> 
>> It used to be that only the first and last page in an IOV were allowed
>> to be of a offset + length < PAGE_SIZE.
> Quite correct.  LNDs have relied on this for years now.
> A change like this should not have occurred without discussion
> about the wider impact.

Actually now that we found what's happening, I think the issue is a bit less clear-cut.

What happens here is the client is submitting two niobufs that are not contiguous.
As such I see no reason why they need to be contiguous in VM too. Sure the 1.8 way of handling
this situation was to send separate RPCs, but I think even if two RDMA descriptors need to be
made, we still save plenty of overhead to justify this.

(basically we send three niobufs in this case: file pages 0-1, 40-47 (the 47th one is partial) and 49 (full) ).

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] discontiguous kiov pages
  2011-06-07 23:57   ` Oleg Drokin
@ 2011-06-08 16:08     ` Nic Henke
  2011-06-08 20:09       ` Jinshan Xiong
  2011-06-09  3:16       ` Oleg Drokin
  0 siblings, 2 replies; 8+ messages in thread
From: Nic Henke @ 2011-06-08 16:08 UTC (permalink / raw)
  To: lustre-devel

On 06/07/2011 06:57 PM, Oleg Drokin wrote:
> Hello!
>

>>> It used to be that only the first and last page in an IOV were allowed
>>> to be of a offset + length<  PAGE_SIZE.
>> Quite correct.  LNDs have relied on this for years now.
>> A change like this should not have occurred without discussion
>> about the wider impact.
>
> Actually now that we found what's happening, I think the issue is a bit less clear-cut.
>
> What happens here is the client is submitting two niobufs that are not contiguous.
> As such I see no reason why they need to be contiguous in VM too. Sure the 1.8 way of handling
> this situation was to send separate RPCs, but I think even if two RDMA descriptors need to be
> made, we still save plenty of overhead to justify this.
>
> (basically we send three niobufs in this case: file pages 0-1, 40-47 (the 47th one is partial) and 49 (full) ).

Oleg - it isn't clear to me what fix you are suggesting here. Are you 
saying LNet/LNDs should handle this situation (partial internal page) 
under the covers by setting up multiple RDMA on their own? This sounds 
like an LND API change, requiring a fix and validation for every LND. I 
*think* we might end up violating LNet layering here by having to adjust 
internal LNet structures from the LND to make sure the 2nd and 
subsequent RDMA landed at the correct spot in the MD, etc.

At least for our network, and I'd venture a guess for others, there is 
no way to handle the partial page other than multiple RDMA at the LND 
layer. When mapping these pages for RDMA, the internal hole can't be 
handled as we just map a set of physical pages for the HW to read 
from/write into with a single (address,length) vector. The internal hole 
would be ignored and would end up corrupting data as we overwrite the hole.

Cheers,
Nic

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] discontiguous kiov pages
  2011-06-08 16:08     ` Nic Henke
@ 2011-06-08 20:09       ` Jinshan Xiong
  2011-06-09  8:37         ` Peter Jones
  2011-06-09  3:16       ` Oleg Drokin
  1 sibling, 1 reply; 8+ messages in thread
From: Jinshan Xiong @ 2011-06-08 20:09 UTC (permalink / raw)
  To: lustre-devel


On Jun 8, 2011, at 9:08 AM, Nic Henke wrote:

> On 06/07/2011 06:57 PM, Oleg Drokin wrote:
>> Hello!
>> 
> 
>>>> It used to be that only the first and last page in an IOV were allowed
>>>> to be of a offset + length<  PAGE_SIZE.
>>> Quite correct.  LNDs have relied on this for years now.
>>> A change like this should not have occurred without discussion
>>> about the wider impact.
>> 
>> Actually now that we found what's happening, I think the issue is a bit less clear-cut.
>> 
>> What happens here is the client is submitting two niobufs that are not contiguous.
>> As such I see no reason why they need to be contiguous in VM too. Sure the 1.8 way of handling
>> this situation was to send separate RPCs, but I think even if two RDMA descriptors need to be
>> made, we still save plenty of overhead to justify this.
>> 
>> (basically we send three niobufs in this case: file pages 0-1, 40-47 (the 47th one is partial) and 49 (full) ).
> 
> Oleg - it isn't clear to me what fix you are suggesting here. Are you 
> saying LNet/LNDs should handle this situation (partial internal page) 
> under the covers by setting up multiple RDMA on their own? This sounds 
> like an LND API change, requiring a fix and validation for every LND. I 
> *think* we might end up violating LNet layering here by having to adjust 
> internal LNet structures from the LND to make sure the 2nd and 
> subsequent RDMA landed at the correct spot in the MD, etc.

Please refer to LU-394 for detail description of this problem. For those who cannot access our jira system, I'm going to summarize it here.

The problem is as follows:
0. First, the app wrote a partial page A at the end of file, and it had enough grant on client side, so page A was cached;
1. app sought forward and wrote another page B which exceeded the quota limit, so it has to be written in sync mode(see vvp_io_commit_write);
2. in current implementation of CLIO, for performance consideration, writing page B will include as many cached pages as possible to compose an RPC, this includes page A;

So here comes the problem. The file size can only be extended until writing of page B succeed, otherwise, the file size is wrong in case writing of B fails. This causes ap_refresh_count() to page A returned oap_count less than CFS_PAGE_SIZE. This is why LND saw uncontiguous pages.

Fixing this issue is easy, we only write the sync page in a standalone RPC(not combine with cached pages). This is not a big issue, since it occurs only when quota runs out.

> 
> At least for our network, and I'd venture a guess for others, there is 
> no way to handle the partial page other than multiple RDMA at the LND 
> layer. When mapping these pages for RDMA, the internal hole can't be 
> handled as we just map a set of physical pages for the HW to read 
> from/write into with a single (address,length) vector. The internal hole 
> would be ignored and would end up corrupting data as we overwrite the hole.

Can't aglee more. Having multiple RDMA descriptors will make LND more complex.

What if we transferred the whole page anyway? This is okay because page offset and length will tell server which part of data is really useful. It will waste some bandwidth, but it's far better than issuing one more RPC.

Thanks,
Jinshan

> 
> Cheers,
> Nic
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lustre.org/pipermail/lustre-devel-lustre.org/attachments/20110608/f788ae77/attachment.htm>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] discontiguous kiov pages
  2011-06-08 16:08     ` Nic Henke
  2011-06-08 20:09       ` Jinshan Xiong
@ 2011-06-09  3:16       ` Oleg Drokin
  2011-06-09 14:38         ` Eric Barton
  1 sibling, 1 reply; 8+ messages in thread
From: Oleg Drokin @ 2011-06-09  3:16 UTC (permalink / raw)
  To: lustre-devel

Hello!

On Jun 8, 2011, at 12:08 PM, Nic Henke wrote:
> Oleg - it isn't clear to me what fix you are suggesting here. Are you 
> saying LNet/LNDs should handle this situation (partial internal page) 
> under the covers by setting up multiple RDMA on their own? This sounds 
> like an LND API change, requiring a fix and validation for every LND. I 
> *think* we might end up violating LNet layering here by having to adjust 
> internal LNet structures from the LND to make sure the 2nd and 
> subsequent RDMA landed at the correct spot in the MD, etc.

What about if we just transfer all such "partial" pages as full pages instead?
Should work for our and your purposes.
For Lustre purposes these two are almost like two different transfers anyway.
In this case we are totally not expecting you to "collapse" the hole.

Hm, I guess this is best handled on the osc level anyway, though. So
perhaps no changes from lnd side then.

Bye,
    Oleg
--
Oleg Drokin
Senior Software Engineer
Whamcloud, Inc.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] discontiguous kiov pages
  2011-06-08 20:09       ` Jinshan Xiong
@ 2011-06-09  8:37         ` Peter Jones
  0 siblings, 0 replies; 8+ messages in thread
From: Peter Jones @ 2011-06-09  8:37 UTC (permalink / raw)
  To: lustre-devel

Jinshan

While I am sure that many appreciate a high-level overview in the thread 
instead of needing to redirect to JIRA, everyone should be able to 
access JIRA if they did want to :-)

Peter
Whamcloud

On 11-06-08 1:09 PM, Jinshan Xiong wrote:
>
> <snip>

> For those who cannot access our jira system, I'm going to summarize it 
> here.
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Lustre-devel] discontiguous kiov pages
  2011-06-09  3:16       ` Oleg Drokin
@ 2011-06-09 14:38         ` Eric Barton
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Barton @ 2011-06-09 14:38 UTC (permalink / raw)
  To: lustre-devel

It seems to me that Jay's suggestion to put the niobufs into
separate RPCs is a good one - particularly since writing the 2nd
niobuf should only be attempted after the first to ensure the
file size is set correctly (BTW this means the 2nd RPC cannot
be posted until the first has completed - otherwise the RPCs
could get re-ordered in the network or at the server).  

However it would be nice to aggregate small, possibly unrelated
I/Os and if/when we do that this issue will crop up again.  If
we stick with the rule that MDs cannot have internal partial pages,
we're forced to use 1 MD for each niobuf.  Putting several of these
in 1 RPC requires separate matchbits for each niobuf to ensure
correct match of source and sink buffers independent of races in
the network.  This must be more efficient than scheduling multiple 
concurrent RPCs each with 1 niobuf, but by how much isn't clear,
since the bulk transfer phases of both schemes should cause identical
network traffic.  

So aggregation will probably require LNET/LND support for MDs with
internal partial pages.  At a guess, this will have strict limits
for some LNDs and probably can't be done without reducing the total
number of fragments in such messages.  Also, the interaction with
LNET routers needs to be considered since mismatched RDMA descriptors
can potentially double the number of actual RDMA fragments on the
wire.

          Cheers,
                   Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-06-09 14:38 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-02  2:33 [Lustre-devel] discontiguous kiov pages wang
2011-06-02 13:19 ` Eric Barton
2011-06-07 23:57   ` Oleg Drokin
2011-06-08 16:08     ` Nic Henke
2011-06-08 20:09       ` Jinshan Xiong
2011-06-09  8:37         ` Peter Jones
2011-06-09  3:16       ` Oleg Drokin
2011-06-09 14:38         ` Eric Barton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.