All of lore.kernel.org
 help / color / mirror / Atom feed
* Blueprint: inline data support (step 2)
@ 2013-07-31  0:10 Li Wang
  2013-07-31  0:17 ` Loic Dachary
  2013-08-09 18:18 ` Sage Weil
  0 siblings, 2 replies; 5+ messages in thread
From: Li Wang @ 2013-07-31  0:10 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org; +Cc: Sage Weil

We have worked out a preliminary implementation for inline data support, 
and observed obvious speed up for small file access.

The step 2 will focus on (1) Try to make things simpler to eliminate the 
state of a file half-inlined; (2) To efficiently deal with the share 
write or read/write; (3) Protocal backword compatability

We submitted a blueprint for this, anyone interested is welcome to join 
the discussion.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Blueprint: inline data support (step 2)
  2013-07-31  0:10 Blueprint: inline data support (step 2) Li Wang
@ 2013-07-31  0:17 ` Loic Dachary
  2013-07-31 14:34   ` Li Wang
  2013-08-09 18:18 ` Sage Weil
  1 sibling, 1 reply; 5+ messages in thread
From: Loic Dachary @ 2013-07-31  0:17 UTC (permalink / raw)
  To: Li Wang; +Cc: ceph-devel@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1021 bytes --]

Hi,

It would be nice to have URLs to the current implementation and the benchmark results you got in the blueprint.

http://wiki.ceph.com/01Planning/02Blueprints/Emperor/Inline_data_support_%28Step_2%29

Cheers

On 31/07/2013 02:10, Li Wang wrote:
> We have worked out a preliminary implementation for inline data support, and observed obvious speed up for small file access.
> 
> The step 2 will focus on (1) Try to make things simpler to eliminate the state of a file half-inlined; (2) To efficiently deal with the share write or read/write; (3) Protocal backword compatability
> 
> We submitted a blueprint for this, anyone interested is welcome to join the discussion.
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 261 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Blueprint: inline data support (step 2)
  2013-07-31  0:17 ` Loic Dachary
@ 2013-07-31 14:34   ` Li Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Li Wang @ 2013-07-31 14:34 UTC (permalink / raw)
  To: Loic Dachary; +Cc: ceph-devel@vger.kernel.org

Sure, just being busy with some other things, I will post it before the 
summit.

Cheers,
Li Wang

On 07/31/2013 08:17 AM, Loic Dachary wrote:
> Hi,
>
> It would be nice to have URLs to the current implementation and the benchmark results you got in the blueprint.
>
> http://wiki.ceph.com/01Planning/02Blueprints/Emperor/Inline_data_support_%28Step_2%29
>
> Cheers
>
> On 31/07/2013 02:10, Li Wang wrote:
>> We have worked out a preliminary implementation for inline data support, and observed obvious speed up for small file access.
>>
>> The step 2 will focus on (1) Try to make things simpler to eliminate the state of a file half-inlined; (2) To efficiently deal with the share write or read/write; (3) Protocal backword compatability
>>
>> We submitted a blueprint for this, anyone interested is welcome to join the discussion.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Blueprint: inline data support (step 2)
  2013-07-31  0:10 Blueprint: inline data support (step 2) Li Wang
  2013-07-31  0:17 ` Loic Dachary
@ 2013-08-09 18:18 ` Sage Weil
  2013-08-11  4:29   ` Li Wang
  1 sibling, 1 reply; 5+ messages in thread
From: Sage Weil @ 2013-08-09 18:18 UTC (permalink / raw)
  To: Li Wang; +Cc: ceph-devel@vger.kernel.org

Hi Li,

Thanks for discussing this at the summit!  As I mentioned, I think email 
will be the easiest way to detail my suggestion for handling the shared 
writer or read/write case.  The notes from the summit are at

  http://pad.ceph.com/p/mds-inline-data

For the single-writer case, it is simple enough for the client to simply 
dirty the buffer with the inline data and write it out with everything 
else.  When it flushes the cap back to the MDS there will be some marker 
(inline_version = 0?) indicating that the data is no longer inlined.

For the multi-writer case:

We normally do reads and writes synchronously to the OSD for simplicity.  
Everything gets ordered there at the object.  I think we can do the same 
for inline data: if there are shared writers, we uninline the data and 
fall back to storing the data in the usual way.

Each writer will have a copy of the *initial* inline data, issued by the 
MDS when they got the capability allowing them to write (or read).

On the *first* read or write operation, the client will first send an 
operation to the object that looks like

  ObjectOperation m;
  m.create(true);   // exclusive create; fails if object exists
  m.write_full(initial_inline_data);
  objecter->mutate(...);

The first client whose op reaches the osd will effectively un-inline the 
data; any others will be no-ops.  This will be immediately followed by 
the actual read or write operation that they are trying to do.

As long as the inline_data size is smaller than the file layout stripe 
unit, this will always be the first object.

When the caps are released to the MDS, if *any* of the clients indicate 
that they uninlined the object, it is uninlined.  (Some clients may not 
have done any IO.)  If a client fails, we need to make the recovery path 
see if the object exists and, if so, drop the inline data.

The one wrinkle I see in this is that the m.create(true) call above isn't 
quite right; the first object will often exist because of the backtrace 
information that the MDS is maintaining (for NFS and future fsck).  We 
need to replace that with some explicit flag on the object that the data 
is inlined, which means some tricky updates and an m.cmpxattr() call.  
Alternatively (and more simply), we can just check if the object has size 
0.  There isn't a rados op that lets us do that right now, but it is 
pretty simple to add.  cmpsize() or similar.

What do you think?

sage

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Blueprint: inline data support (step 2)
  2013-08-09 18:18 ` Sage Weil
@ 2013-08-11  4:29   ` Li Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Li Wang @ 2013-08-11  4:29 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Hi Sage,
   I am on holiday, actually including the CDS day:)
   I will take care of it later. Thanks for your comments.

Cheers,
Li Wang

On 08/10/2013 02:18 AM, Sage Weil wrote:
> Hi Li,
>
> Thanks for discussing this at the summit!  As I mentioned, I think email
> will be the easiest way to detail my suggestion for handling the shared
> writer or read/write case.  The notes from the summit are at
>
>    http://pad.ceph.com/p/mds-inline-data
>
> For the single-writer case, it is simple enough for the client to simply
> dirty the buffer with the inline data and write it out with everything
> else.  When it flushes the cap back to the MDS there will be some marker
> (inline_version = 0?) indicating that the data is no longer inlined.
>
> For the multi-writer case:
>
> We normally do reads and writes synchronously to the OSD for simplicity.
> Everything gets ordered there at the object.  I think we can do the same
> for inline data: if there are shared writers, we uninline the data and
> fall back to storing the data in the usual way.
>
> Each writer will have a copy of the *initial* inline data, issued by the
> MDS when they got the capability allowing them to write (or read).
>
> On the *first* read or write operation, the client will first send an
> operation to the object that looks like
>
>    ObjectOperation m;
>    m.create(true);   // exclusive create; fails if object exists
>    m.write_full(initial_inline_data);
>    objecter->mutate(...);
>
> The first client whose op reaches the osd will effectively un-inline the
> data; any others will be no-ops.  This will be immediately followed by
> the actual read or write operation that they are trying to do.
>
> As long as the inline_data size is smaller than the file layout stripe
> unit, this will always be the first object.
>
> When the caps are released to the MDS, if *any* of the clients indicate
> that they uninlined the object, it is uninlined.  (Some clients may not
> have done any IO.)  If a client fails, we need to make the recovery path
> see if the object exists and, if so, drop the inline data.
>
> The one wrinkle I see in this is that the m.create(true) call above isn't
> quite right; the first object will often exist because of the backtrace
> information that the MDS is maintaining (for NFS and future fsck).  We
> need to replace that with some explicit flag on the object that the data
> is inlined, which means some tricky updates and an m.cmpxattr() call.
> Alternatively (and more simply), we can just check if the object has size
> 0.  There isn't a rados op that lets us do that right now, but it is
> pretty simple to add.  cmpsize() or similar.
>
> What do you think?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-08-11  4:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-31  0:10 Blueprint: inline data support (step 2) Li Wang
2013-07-31  0:17 ` Loic Dachary
2013-07-31 14:34   ` Li Wang
2013-08-09 18:18 ` Sage Weil
2013-08-11  4:29   ` Li Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.