All of lore.kernel.org
 help / color / mirror / Atom feed
* rados read ordering
@ 2014-12-08 17:03 Sage Weil
  2014-12-08 17:11 ` Yehuda Sadeh
  2014-12-08 23:38 ` Josh Durgin
  0 siblings, 2 replies; 12+ messages in thread
From: Sage Weil @ 2014-12-08 17:03 UTC (permalink / raw)
  To: sjust, jdurgin, yehuda, dillaman; +Cc: zhiqiang.wang, ceph-devel

The current RADOS behavior is that reads (on any given object) are always 
processed in the order they are submitted by the client.  This causes a 
few headaches for the cache tiering that it would be nice to avoid.  It 
also occurs to me that there are likely cases where we could go a lot 
faster by not strictly ordering things.  For example, a stat can respond 
more quickly than a large read, and some reads may hit cache while others 
go to disk.  This doesn't happen currently because of the (lame) way we do 
reads synchronously, but hope that can change too.

I propose we drop this semantic.  If a client wants reads to have a strict 
ordering, they can set the existing RWORDERED flag (which also orders them 
with respect to writes).  That's not the most general thing ever, but I'm 
not sure we care about callers who want reads ordered with respect to each 
other but not writes.

The real question is whether there are any users that want/need this 
currently.  I can't think of any offhand.  In several places we submit 
multiple *writes* and expect them to be strictly ordered (e.g., we 
set a completion on teh last write only).  I don't think we do this 
anywhere for reads though...

Josh, Yehuda, Jason--can you think of any in RBD or RGW that would depend 
on this?

sage

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rados read ordering
  2014-12-08 17:03 rados read ordering Sage Weil
@ 2014-12-08 17:11 ` Yehuda Sadeh
  2014-12-10  0:37   ` Cook, Nigel
  2014-12-08 23:38 ` Josh Durgin
  1 sibling, 1 reply; 12+ messages in thread
From: Yehuda Sadeh @ 2014-12-08 17:11 UTC (permalink / raw)
  To: Sage Weil
  Cc: Samuel Just, Josh Durgin, Jason Dillaman, zhiqiang.wang,
	ceph-devel

On Mon, Dec 8, 2014 at 9:03 AM, Sage Weil <sweil@redhat.com> wrote:
> The current RADOS behavior is that reads (on any given object) are always
> processed in the order they are submitted by the client.  This causes a
> few headaches for the cache tiering that it would be nice to avoid.  It
> also occurs to me that there are likely cases where we could go a lot
> faster by not strictly ordering things.  For example, a stat can respond
> more quickly than a large read, and some reads may hit cache while others
> go to disk.  This doesn't happen currently because of the (lame) way we do
> reads synchronously, but hope that can change too.
>
> I propose we drop this semantic.  If a client wants reads to have a strict
> ordering, they can set the existing RWORDERED flag (which also orders them
> with respect to writes).  That's not the most general thing ever, but I'm
> not sure we care about callers who want reads ordered with respect to each
> other but not writes.
>
> The real question is whether there are any users that want/need this
> currently.  I can't think of any offhand.  In several places we submit
> multiple *writes* and expect them to be strictly ordered (e.g., we
> set a completion on teh last write only).  I don't think we do this
> anywhere for reads though...
>
> Josh, Yehuda, Jason--can you think of any in RBD or RGW that would depend
> on this?
>

None that I can think of. For objects data, we already stripe it
across multiple objects, and the underlying assumption is that we're
going to get responses out of order so we make sure we commit
in-order. Guards are used on the head object, and the read is
synchronous there anyway. I can't think of any other place where we'd
have an issue.

Yehuda

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rados read ordering
  2014-12-08 17:03 rados read ordering Sage Weil
  2014-12-08 17:11 ` Yehuda Sadeh
@ 2014-12-08 23:38 ` Josh Durgin
  1 sibling, 0 replies; 12+ messages in thread
From: Josh Durgin @ 2014-12-08 23:38 UTC (permalink / raw)
  To: Sage Weil, sjust, jdurgin, yehuda, dillaman; +Cc: zhiqiang.wang, ceph-devel

On 12/08/2014 09:03 AM, Sage Weil wrote:
> The current RADOS behavior is that reads (on any given object) are always
> processed in the order they are submitted by the client.  This causes a
> few headaches for the cache tiering that it would be nice to avoid.  It
> also occurs to me that there are likely cases where we could go a lot
> faster by not strictly ordering things.  For example, a stat can respond
> more quickly than a large read, and some reads may hit cache while others
> go to disk.  This doesn't happen currently because of the (lame) way we do
> reads synchronously, but hope that can change too.
>
> I propose we drop this semantic.  If a client wants reads to have a strict
> ordering, they can set the existing RWORDERED flag (which also orders them
> with respect to writes).  That's not the most general thing ever, but I'm
> not sure we care about callers who want reads ordered with respect to each
> other but not writes.
>
> The real question is whether there are any users that want/need this
> currently.  I can't think of any offhand.  In several places we submit
> multiple *writes* and expect them to be strictly ordered (e.g., we
> set a completion on teh last write only).  I don't think we do this
> anywhere for reads though...
>
> Josh, Yehuda, Jason--can you think of any in RBD or RGW that would depend
> on this?

Nope, I've thought we should fix this since I found out about it.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rados read ordering
  2014-12-08 17:11 ` Yehuda Sadeh
@ 2014-12-10  0:37   ` Cook, Nigel
  2014-12-10  0:43     ` Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Cook, Nigel @ 2014-12-10  0:37 UTC (permalink / raw)
  To: Yehuda Sadeh
  Cc: Josh Durgin, Samuel Just, Sage Weil, Wang, Zhiqiang,
	Jason Dillaman, ceph-devel

Folks

I'm wondering if this is related to the question I posed a few days ago..

 Can CEPH support 2 clients simultaneously accessing a single volume - for example a database cluster - and honor read and write order of blocks across the multiple clients?

Can you comment?

Regards,
Nigel Cook +1 720 319 7508

Sent from a mobile device.
Please excuse both my brevity and sp3lling

On Dec 8, 2014 10:12 AM, Yehuda Sadeh <yehuda@redhat.com> wrote:
On Mon, Dec 8, 2014 at 9:03 AM, Sage Weil <sweil@redhat.com> wrote:
> The current RADOS behavior is that reads (on any given object) are always
> processed in the order they are submitted by the client.  This causes a
> few headaches for the cache tiering that it would be nice to avoid.  It
> also occurs to me that there are likely cases where we could go a lot
> faster by not strictly ordering things.  For example, a stat can respond
> more quickly than a large read, and some reads may hit cache while others
> go to disk.  This doesn't happen currently because of the (lame) way we do
> reads synchronously, but hope that can change too.
>
> I propose we drop this semantic.  If a client wants reads to have a strict
> ordering, they can set the existing RWORDERED flag (which also orders them
> with respect to writes).  That's not the most general thing ever, but I'm
> not sure we care about callers who want reads ordered with respect to each
> other but not writes.
>
> The real question is whether there are any users that want/need this
> currently.  I can't think of any offhand.  In several places we submit
> multiple *writes* and expect them to be strictly ordered (e.g., we
> set a completion on teh last write only).  I don't think we do this
> anywhere for reads though...
>
> Josh, Yehuda, Jason--can you think of any in RBD or RGW that would depend
> on this?
>

None that I can think of. For objects data, we already stripe it
across multiple objects, and the underlying assumption is that we're
going to get responses out of order so we make sure we commit
in-order. Guards are used on the head object, and the read is
synchronous there anyway. I can't think of any other place where we'd
have an issue.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rados read ordering
  2014-12-10  0:37   ` Cook, Nigel
@ 2014-12-10  0:43     ` Sage Weil
  2014-12-10  1:10       ` Wang, Zhiqiang
  0 siblings, 1 reply; 12+ messages in thread
From: Sage Weil @ 2014-12-10  0:43 UTC (permalink / raw)
  To: Cook, Nigel
  Cc: Yehuda Sadeh, Josh Durgin, Samuel Just, Wang, Zhiqiang,
	Jason Dillaman, ceph-devel

On Wed, 10 Dec 2014, Cook, Nigel wrote:
> Folks
> 
> I'm wondering if this is related to the question I posed a few days 
> ago..
> 
>  Can CEPH support 2 clients simultaneously accessing a single volume - 
> for example a database cluster - and honor read and write order of 
> blocks across the multiple clients?
> 
> Can you comment?

I don't think so.  This (non)change is about a single client submitting 
two reads to a single object (say, different blocks in the same disk) and 
whether the OSD is allowed to respond out of order (say, because some 
blocks are in cache and some aren't).

In the shared volume case, it is generally not important what happens with 
requests that are submitted in parallel.. they can take different amounts 
of time on the wire and which happens first (i.e., which arrives at the 
OSD first) depends on happenstance.  What does matter is that any read 
that happens after a complete write reflects that read, which is why the 
concern is around caching.

sage


> 
> Regards,
> Nigel Cook +1 720 319 7508
> 
> Sent from a mobile device.
> Please excuse both my brevity and sp3lling
> 
> On Dec 8, 2014 10:12 AM, Yehuda Sadeh <yehuda@redhat.com> wrote:
> On Mon, Dec 8, 2014 at 9:03 AM, Sage Weil <sweil@redhat.com> wrote:
> > The current RADOS behavior is that reads (on any given object) are always
> > processed in the order they are submitted by the client.  This causes a
> > few headaches for the cache tiering that it would be nice to avoid.  It
> > also occurs to me that there are likely cases where we could go a lot
> > faster by not strictly ordering things.  For example, a stat can respond
> > more quickly than a large read, and some reads may hit cache while others
> > go to disk.  This doesn't happen currently because of the (lame) way we do
> > reads synchronously, but hope that can change too.
> >
> > I propose we drop this semantic.  If a client wants reads to have a strict
> > ordering, they can set the existing RWORDERED flag (which also orders them
> > with respect to writes).  That's not the most general thing ever, but I'm
> > not sure we care about callers who want reads ordered with respect to each
> > other but not writes.
> >
> > The real question is whether there are any users that want/need this
> > currently.  I can't think of any offhand.  In several places we submit
> > multiple *writes* and expect them to be strictly ordered (e.g., we
> > set a completion on teh last write only).  I don't think we do this
> > anywhere for reads though...
> >
> > Josh, Yehuda, Jason--can you think of any in RBD or RGW that would depend
> > on this?
> >
> 
> None that I can think of. For objects data, we already stripe it
> across multiple objects, and the underlying assumption is that we're
> going to get responses out of order so we make sure we commit
> in-order. Guards are used on the head object, and the read is
> synchronous there anyway. I can't think of any other place where we'd
> have an issue.
> 
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: rados read ordering
  2014-12-10  0:43     ` Sage Weil
@ 2014-12-10  1:10       ` Wang, Zhiqiang
  2014-12-10  2:26         ` Haomai Wang
  2014-12-10  4:36         ` Cook, Nigel
  0 siblings, 2 replies; 12+ messages in thread
From: Wang, Zhiqiang @ 2014-12-10  1:10 UTC (permalink / raw)
  To: Sage Weil, Cook, Nigel
  Cc: Yehuda Sadeh, Josh Durgin, Samuel Just, Jason Dillaman,
	ceph-devel

For multiple clients accessing the same volume, same object, I guess the clients need to do some synchronization between them to guarantee what it reads is what it wrote. That is to say, if two clients, say A and B, A is doing write, B is doing read. If B wants to read what A writes, it needs to know A has completed the write before issuing the read. If A hasn't completed the write yet, it could be possible that the write fails, so you can't expect B reads what A writes. I think this makes sense from the client/application perspective.

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, December 10, 2014 8:43 AM
To: Cook, Nigel
Cc: Yehuda Sadeh; Josh Durgin; Samuel Just; Wang, Zhiqiang; Jason Dillaman; ceph-devel
Subject: Re: rados read ordering

On Wed, 10 Dec 2014, Cook, Nigel wrote:
> Folks
> 
> I'm wondering if this is related to the question I posed a few days 
> ago..
> 
>  Can CEPH support 2 clients simultaneously accessing a single volume - 
> for example a database cluster - and honor read and write order of 
> blocks across the multiple clients?
> 
> Can you comment?

I don't think so.  This (non)change is about a single client submitting two reads to a single object (say, different blocks in the same disk) and whether the OSD is allowed to respond out of order (say, because some blocks are in cache and some aren't).

In the shared volume case, it is generally not important what happens with requests that are submitted in parallel.. they can take different amounts of time on the wire and which happens first (i.e., which arrives at the OSD first) depends on happenstance.  What does matter is that any read that happens after a complete write reflects that read, which is why the concern is around caching.

sage


> 
> Regards,
> Nigel Cook +1 720 319 7508
> 
> Sent from a mobile device.
> Please excuse both my brevity and sp3lling
> 
> On Dec 8, 2014 10:12 AM, Yehuda Sadeh <yehuda@redhat.com> wrote:
> On Mon, Dec 8, 2014 at 9:03 AM, Sage Weil <sweil@redhat.com> wrote:
> > The current RADOS behavior is that reads (on any given object) are always
> > processed in the order they are submitted by the client.  This causes a
> > few headaches for the cache tiering that it would be nice to avoid.  It
> > also occurs to me that there are likely cases where we could go a lot
> > faster by not strictly ordering things.  For example, a stat can respond
> > more quickly than a large read, and some reads may hit cache while others
> > go to disk.  This doesn't happen currently because of the (lame) way we do
> > reads synchronously, but hope that can change too.
> >
> > I propose we drop this semantic.  If a client wants reads to have a strict
> > ordering, they can set the existing RWORDERED flag (which also orders them
> > with respect to writes).  That's not the most general thing ever, but I'm
> > not sure we care about callers who want reads ordered with respect to each
> > other but not writes.
> >
> > The real question is whether there are any users that want/need this
> > currently.  I can't think of any offhand.  In several places we submit
> > multiple *writes* and expect them to be strictly ordered (e.g., we
> > set a completion on teh last write only).  I don't think we do this
> > anywhere for reads though...
> >
> > Josh, Yehuda, Jason--can you think of any in RBD or RGW that would depend
> > on this?
> >
> 
> None that I can think of. For objects data, we already stripe it
> across multiple objects, and the underlying assumption is that we're
> going to get responses out of order so we make sure we commit
> in-order. Guards are used on the head object, and the read is
> synchronous there anyway. I can't think of any other place where we'd
> have an issue.
> 
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: rados read ordering
  2014-12-10  1:10       ` Wang, Zhiqiang
@ 2014-12-10  2:26         ` Haomai Wang
  2014-12-10  2:51           ` Wang, Zhiqiang
  2014-12-10  4:36         ` Cook, Nigel
  1 sibling, 1 reply; 12+ messages in thread
From: Haomai Wang @ 2014-12-10  2:26 UTC (permalink / raw)
  To: Wang, Zhiqiang
  Cc: Sage Weil, Cook, Nigel, Yehuda Sadeh, Josh Durgin, Samuel Just,
	Jason Dillaman, ceph-devel

On Wed, Dec 10, 2014 at 9:10 AM, Wang, Zhiqiang <zhiqiang.wang@intel.com> wrote:
> For multiple clients accessing the same volume, same object, I guess the clients need to do some synchronization between them to guarantee what it reads is what it wrote. That is to say, if two clients, say A and B, A is doing write, B is doing read. If B wants to read what A writes, it needs to know A has completed the write before issuing the read. If A hasn't completed the write yet, it could be possible that the write fails, so you can't expect B reads what A writes. I think this makes sense from the client/application perspective.

I think each write for a object is a barrier. If receiving a write op,
previous read ops all are allowed to out of order and read ops after
write op also can out of order.

I think it's still SERIALIZABLE level for rados client.

>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, December 10, 2014 8:43 AM
> To: Cook, Nigel
> Cc: Yehuda Sadeh; Josh Durgin; Samuel Just; Wang, Zhiqiang; Jason Dillaman; ceph-devel
> Subject: Re: rados read ordering
>
> On Wed, 10 Dec 2014, Cook, Nigel wrote:
>> Folks
>>
>> I'm wondering if this is related to the question I posed a few days
>> ago..
>>
>>  Can CEPH support 2 clients simultaneously accessing a single volume -
>> for example a database cluster - and honor read and write order of
>> blocks across the multiple clients?
>>
>> Can you comment?
>
> I don't think so.  This (non)change is about a single client submitting two reads to a single object (say, different blocks in the same disk) and whether the OSD is allowed to respond out of order (say, because some blocks are in cache and some aren't).
>
> In the shared volume case, it is generally not important what happens with requests that are submitted in parallel.. they can take different amounts of time on the wire and which happens first (i.e., which arrives at the OSD first) depends on happenstance.  What does matter is that any read that happens after a complete write reflects that read, which is why the concern is around caching.
>
> sage
>
>
>>
>> Regards,
>> Nigel Cook +1 720 319 7508
>>
>> Sent from a mobile device.
>> Please excuse both my brevity and sp3lling
>>
>> On Dec 8, 2014 10:12 AM, Yehuda Sadeh <yehuda@redhat.com> wrote:
>> On Mon, Dec 8, 2014 at 9:03 AM, Sage Weil <sweil@redhat.com> wrote:
>> > The current RADOS behavior is that reads (on any given object) are always
>> > processed in the order they are submitted by the client.  This causes a
>> > few headaches for the cache tiering that it would be nice to avoid.  It
>> > also occurs to me that there are likely cases where we could go a lot
>> > faster by not strictly ordering things.  For example, a stat can respond
>> > more quickly than a large read, and some reads may hit cache while others
>> > go to disk.  This doesn't happen currently because of the (lame) way we do
>> > reads synchronously, but hope that can change too.
>> >
>> > I propose we drop this semantic.  If a client wants reads to have a strict
>> > ordering, they can set the existing RWORDERED flag (which also orders them
>> > with respect to writes).  That's not the most general thing ever, but I'm
>> > not sure we care about callers who want reads ordered with respect to each
>> > other but not writes.
>> >
>> > The real question is whether there are any users that want/need this
>> > currently.  I can't think of any offhand.  In several places we submit
>> > multiple *writes* and expect them to be strictly ordered (e.g., we
>> > set a completion on teh last write only).  I don't think we do this
>> > anywhere for reads though...
>> >
>> > Josh, Yehuda, Jason--can you think of any in RBD or RGW that would depend
>> > on this?
>> >
>>
>> None that I can think of. For objects data, we already stripe it
>> across multiple objects, and the underlying assumption is that we're
>> going to get responses out of order so we make sure we commit
>> in-order. Guards are used on the head object, and the read is
>> synchronous there anyway. I can't think of any other place where we'd
>> have an issue.
>>
>> Yehuda
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: rados read ordering
  2014-12-10  2:26         ` Haomai Wang
@ 2014-12-10  2:51           ` Wang, Zhiqiang
  0 siblings, 0 replies; 12+ messages in thread
From: Wang, Zhiqiang @ 2014-12-10  2:51 UTC (permalink / raw)
  To: Haomai Wang
  Cc: Sage Weil, Cook, Nigel, Yehuda Sadeh, Josh Durgin, Samuel Just,
	Jason Dillaman, ceph-devel

What I understand is that if the read op specifies the RWORDERED flag, it is processed in order. Otherwise, it may be out of order.

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Wednesday, December 10, 2014 10:26 AM
To: Wang, Zhiqiang
Cc: Sage Weil; Cook, Nigel; Yehuda Sadeh; Josh Durgin; Samuel Just; Jason Dillaman; ceph-devel
Subject: Re: rados read ordering

On Wed, Dec 10, 2014 at 9:10 AM, Wang, Zhiqiang <zhiqiang.wang@intel.com> wrote:
> For multiple clients accessing the same volume, same object, I guess the clients need to do some synchronization between them to guarantee what it reads is what it wrote. That is to say, if two clients, say A and B, A is doing write, B is doing read. If B wants to read what A writes, it needs to know A has completed the write before issuing the read. If A hasn't completed the write yet, it could be possible that the write fails, so you can't expect B reads what A writes. I think this makes sense from the client/application perspective.

I think each write for a object is a barrier. If receiving a write op, previous read ops all are allowed to out of order and read ops after write op also can out of order.

I think it's still SERIALIZABLE level for rados client.

>
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org 
> [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, December 10, 2014 8:43 AM
> To: Cook, Nigel
> Cc: Yehuda Sadeh; Josh Durgin; Samuel Just; Wang, Zhiqiang; Jason 
> Dillaman; ceph-devel
> Subject: Re: rados read ordering
>
> On Wed, 10 Dec 2014, Cook, Nigel wrote:
>> Folks
>>
>> I'm wondering if this is related to the question I posed a few days 
>> ago..
>>
>>  Can CEPH support 2 clients simultaneously accessing a single volume 
>> - for example a database cluster - and honor read and write order of 
>> blocks across the multiple clients?
>>
>> Can you comment?
>
> I don't think so.  This (non)change is about a single client submitting two reads to a single object (say, different blocks in the same disk) and whether the OSD is allowed to respond out of order (say, because some blocks are in cache and some aren't).
>
> In the shared volume case, it is generally not important what happens with requests that are submitted in parallel.. they can take different amounts of time on the wire and which happens first (i.e., which arrives at the OSD first) depends on happenstance.  What does matter is that any read that happens after a complete write reflects that read, which is why the concern is around caching.
>
> sage
>
>
>>
>> Regards,
>> Nigel Cook +1 720 319 7508
>>
>> Sent from a mobile device.
>> Please excuse both my brevity and sp3lling
>>
>> On Dec 8, 2014 10:12 AM, Yehuda Sadeh <yehuda@redhat.com> wrote:
>> On Mon, Dec 8, 2014 at 9:03 AM, Sage Weil <sweil@redhat.com> wrote:
>> > The current RADOS behavior is that reads (on any given object) are 
>> > always processed in the order they are submitted by the client.  
>> > This causes a few headaches for the cache tiering that it would be 
>> > nice to avoid.  It also occurs to me that there are likely cases 
>> > where we could go a lot faster by not strictly ordering things.  
>> > For example, a stat can respond more quickly than a large read, and 
>> > some reads may hit cache while others go to disk.  This doesn't 
>> > happen currently because of the (lame) way we do reads synchronously, but hope that can change too.
>> >
>> > I propose we drop this semantic.  If a client wants reads to have a 
>> > strict ordering, they can set the existing RWORDERED flag (which 
>> > also orders them with respect to writes).  That's not the most 
>> > general thing ever, but I'm not sure we care about callers who want 
>> > reads ordered with respect to each other but not writes.
>> >
>> > The real question is whether there are any users that want/need 
>> > this currently.  I can't think of any offhand.  In several places 
>> > we submit multiple *writes* and expect them to be strictly ordered 
>> > (e.g., we set a completion on teh last write only).  I don't think 
>> > we do this anywhere for reads though...
>> >
>> > Josh, Yehuda, Jason--can you think of any in RBD or RGW that would 
>> > depend on this?
>> >
>>
>> None that I can think of. For objects data, we already stripe it 
>> across multiple objects, and the underlying assumption is that we're 
>> going to get responses out of order so we make sure we commit 
>> in-order. Guards are used on the head object, and the read is 
>> synchronous there anyway. I can't think of any other place where we'd 
>> have an issue.
>>
>> Yehuda
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
>> in the body of a message to majordomo@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html



--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: rados read ordering
  2014-12-10  1:10       ` Wang, Zhiqiang
  2014-12-10  2:26         ` Haomai Wang
@ 2014-12-10  4:36         ` Cook, Nigel
  2014-12-10  5:08           ` Wang, Zhiqiang
  1 sibling, 1 reply; 12+ messages in thread
From: Cook, Nigel @ 2014-12-10  4:36 UTC (permalink / raw)
  To: Wang, Zhiqiang
  Cc: Josh Durgin, Yehuda Sadeh, Samuel Just, Sage Weil, Jason Dillaman,
	ceph-devel

On thinking of the following use cases and the rbd client..

The ordering scenario is that with client a and b, assuming a and b are messaging between them, then a read posted by client b after client a has successfully written will always read the client a content. Similarly, an ordered write by client a and then by client b followed by a read by a or b will return the client b write content in all cases.

Regards,
Nigel Cook +1 720 319 7508

Sent from a mobile device.
Please excuse both my brevity and sp3lling

On Dec 9, 2014 5:10 PM, "Wang, Zhiqiang" <zhiqiang.wang@intel.com> wrote:
For multiple clients accessing the same volume, same object, I guess the clients need to do some synchronization between them to guarantee what it reads is what it wrote. That is to say, if two clients, say A and B, A is doing write, B is doing read. If B wants to read what A writes, it needs to know A has completed the write before issuing the read. If A hasn't completed the write yet, it could be possible that the write fails, so you can't expect B reads what A writes. I think this makes sense from the client/application perspective.

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, December 10, 2014 8:43 AM
To: Cook, Nigel
Cc: Yehuda Sadeh; Josh Durgin; Samuel Just; Wang, Zhiqiang; Jason Dillaman; ceph-devel
Subject: Re: rados read ordering

On Wed, 10 Dec 2014, Cook, Nigel wrote:
> Folks
>
> I'm wondering if this is related to the question I posed a few days
> ago..
>
>  Can CEPH support 2 clients simultaneously accessing a single volume -
> for example a database cluster - and honor read and write order of
> blocks across the multiple clients?
>
> Can you comment?

I don't think so.  This (non)change is about a single client submitting two reads to a single object (say, different blocks in the same disk) and whether the OSD is allowed to respond out of order (say, because some blocks are in cache and some aren't).

In the shared volume case, it is generally not important what happens with requests that are submitted in parallel.. they can take different amounts of time on the wire and which happens first (i.e., which arrives at the OSD first) depends on happenstance.  What does matter is that any read that happens after a complete write reflects that read, which is why the concern is around caching.

sage


>
> Regards,
> Nigel Cook +1 720 319 7508
>
> Sent from a mobile device.
> Please excuse both my brevity and sp3lling
>
> On Dec 8, 2014 10:12 AM, Yehuda Sadeh <yehuda@redhat.com> wrote:
> On Mon, Dec 8, 2014 at 9:03 AM, Sage Weil <sweil@redhat.com> wrote:
> > The current RADOS behavior is that reads (on any given object) are always
> > processed in the order they are submitted by the client.  This causes a
> > few headaches for the cache tiering that it would be nice to avoid.  It
> > also occurs to me that there are likely cases where we could go a lot
> > faster by not strictly ordering things.  For example, a stat can respond
> > more quickly than a large read, and some reads may hit cache while others
> > go to disk.  This doesn't happen currently because of the (lame) way we do
> > reads synchronously, but hope that can change too.
> >
> > I propose we drop this semantic.  If a client wants reads to have a strict
> > ordering, they can set the existing RWORDERED flag (which also orders them
> > with respect to writes).  That's not the most general thing ever, but I'm
> > not sure we care about callers who want reads ordered with respect to each
> > other but not writes.
> >
> > The real question is whether there are any users that want/need this
> > currently.  I can't think of any offhand.  In several places we submit
> > multiple *writes* and expect them to be strictly ordered (e.g., we
> > set a completion on teh last write only).  I don't think we do this
> > anywhere for reads though...
> >
> > Josh, Yehuda, Jason--can you think of any in RBD or RGW that would depend
> > on this?
> >
>
> None that I can think of. For objects data, we already stripe it
> across multiple objects, and the underlying assumption is that we're
> going to get responses out of order so we make sure we commit
> in-order. Guards are used on the head object, and the read is
> synchronous there anyway. I can't think of any other place where we'd
> have an issue.
>
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: rados read ordering
  2014-12-10  4:36         ` Cook, Nigel
@ 2014-12-10  5:08           ` Wang, Zhiqiang
  2014-12-10 16:07             ` Cook, Nigel
  0 siblings, 1 reply; 12+ messages in thread
From: Wang, Zhiqiang @ 2014-12-10  5:08 UTC (permalink / raw)
  To: Cook, Nigel
  Cc: Josh Durgin, Yehuda Sadeh, Samuel Just, Sage Weil, Jason Dillaman,
	ceph-devel

In the 2nd scenario, what if the write by client B fails for some reason? Do we also fail the following read by A or B? Or return the read with the client A write content? If we return the read with client A write content, client A still needs to communicate with B to know which version of the content it reads. So my point is, from the client's perspective, it shouldn't expect to read the content which hasn't been successfully written to the storage. Thus it's ok for ceph to not support ordering the reads/writes from multiple clients.

From: Cook, Nigel 
Sent: Wednesday, December 10, 2014 12:37 PM
To: Wang, Zhiqiang
Cc: Josh Durgin; Yehuda Sadeh; Samuel Just; Sage Weil; Jason Dillaman; ceph-devel
Subject: RE: rados read ordering

On thinking of the following use cases and the rbd client..
The ordering scenario is that with client a and b, assuming a and b are messaging between them, then a read posted by client b after client a has successfully written will always read the client a content. Similarly, an ordered write by client a and then by client b followed by a read by a or b will return the client b write content in all cases.
Regards,
Nigel Cook +1 720 319 7508
Sent from a mobile device.
Please excuse both my brevity and sp3lling
On Dec 9, 2014 5:10 PM, "Wang, Zhiqiang" <zhiqiang.wang@intel.com> wrote:
For multiple clients accessing the same volume, same object, I guess the clients need to do some synchronization between them to guarantee what it reads is what it wrote. That is to say, if two clients, say A and B, A is doing write, B is doing read. If B wants to read what A writes, it needs to know A has completed the write before issuing the read. If A hasn't completed the write yet, it could be possible that the write fails, so you can't expect B reads what A writes. I think this makes sense from the client/application perspective.

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, December 10, 2014 8:43 AM
To: Cook, Nigel
Cc: Yehuda Sadeh; Josh Durgin; Samuel Just; Wang, Zhiqiang; Jason Dillaman; ceph-devel
Subject: Re: rados read ordering

On Wed, 10 Dec 2014, Cook, Nigel wrote:
> Folks
> 
> I'm wondering if this is related to the question I posed a few days 
> ago..
> 
>  Can CEPH support 2 clients simultaneously accessing a single volume - 
> for example a database cluster - and honor read and write order of 
> blocks across the multiple clients?
> 
> Can you comment?

I don't think so.  This (non)change is about a single client submitting two reads to a single object (say, different blocks in the same disk) and whether the OSD is allowed to respond out of order (say, because some blocks are in cache and some aren't).

In the shared volume case, it is generally not important what happens with requests that are submitted in parallel.. they can take different amounts of time on the wire and which happens first (i.e., which arrives at the OSD first) depends on happenstance.  What does matter is that any read that happens after a complete write reflects that read, which is why the concern is around caching.

sage


> 
> Regards,
> Nigel Cook +1 720 319 7508
> 
> Sent from a mobile device.
> Please excuse both my brevity and sp3lling
> 
> On Dec 8, 2014 10:12 AM, Yehuda Sadeh <yehuda@redhat.com> wrote:
> On Mon, Dec 8, 2014 at 9:03 AM, Sage Weil <sweil@redhat.com> wrote:
> > The current RADOS behavior is that reads (on any given object) are always
> > processed in the order they are submitted by the client.  This causes a
> > few headaches for the cache tiering that it would be nice to avoid.  It
> > also occurs to me that there are likely cases where we could go a lot
> > faster by not strictly ordering things.  For example, a stat can respond
> > more quickly than a large read, and some reads may hit cache while others
> > go to disk.  This doesn't happen currently because of the (lame) way we do
> > reads synchronously, but hope that can change too.
> >
> > I propose we drop this semantic.  If a client wants reads to have a strict
> > ordering, they can set the existing RWORDERED flag (which also orders them
> > with respect to writes).  That's not the most general thing ever, but I'm
> > not sure we care about callers who want reads ordered with respect to each
> > other but not writes.
> >
> > The real question is whether there are any users that want/need this
> > currently.  I can't think of any offhand.  In several places we submit
> > multiple *writes* and expect them to be strictly ordered (e.g., we
> > set a completion on teh last write only).  I don't think we do this
> > anywhere for reads though...
> >
> > Josh, Yehuda, Jason--can you think of any in RBD or RGW that would depend
> > on this?
> >
> 
> None that I can think of. For objects data, we already stripe it
> across multiple objects, and the underlying assumption is that we're
> going to get responses out of order so we make sure we commit
> in-order. Guards are used on the head object, and the read is
> synchronous there anyway. I can't think of any other place where we'd
> have an issue.
> 
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: rados read ordering
  2014-12-10  5:08           ` Wang, Zhiqiang
@ 2014-12-10 16:07             ` Cook, Nigel
  2014-12-10 16:35               ` Sage Weil
  0 siblings, 1 reply; 12+ messages in thread
From: Cook, Nigel @ 2014-12-10 16:07 UTC (permalink / raw)
  To: Wang, Zhiqiang
  Cc: Yehuda Sadeh, Josh Durgin, Samuel Just, Sage Weil, Jason Dillaman,
	ceph-devel

Thanks for the reply on this.

In my use case its happy path.. All writes work, and in this case I want read and write order honored. I'm not looking for my block subsystem to have eventual consistency.

For the unhappy path if client b write fails then I'm expecting a subsequent read on client a and b to read the content written by client a.

I don't understand the conclusion in your final sentence

"So my point is, from the client’s perspective, it shouldn’t expect to read the content which hasn’t been successfully written to the storage."
I agree with this.

"Thus it's ok for ceph to not support ordering the reads/writes from multiple clients."

I don't understand how you reach this conclusion.

I might alter to say .. Its OK if read request of different blocks respond out of order, but for a given block read/write requests are ordered in time.

This statement worries me a little. I'm guessing out of order read requests is only OK when the blocks aren't also being actively written. I think its implied in the statement but doesn't come out explicitly.

Regards,
Nigel Cook +1 720 319 7508

Sent from a mobile device.
Please excuse both my brevity and sp3lling

On Dec 9, 2014 9:08 PM, "Wang, Zhiqiang" <zhiqiang.wang@intel.com> wrote:
In the 2nd scenario, what if the write by client B fails for some reason? Do we also fail the following read by A or B? Or return the read with the client A write content? If we return the read with client A write content, client A still needs to communicate with B to know which version of the content it reads. So my point is, from the client’s perspective, it shouldn’t expect to read the content which hasn’t been successfully written to the storage. Thus it's ok for ceph to not support ordering the reads/writes from multiple clients.

From: Cook, Nigel
Sent: Wednesday, December 10, 2014 12:37 PM
To: Wang, Zhiqiang
Cc: Josh Durgin; Yehuda Sadeh; Samuel Just; Sage Weil; Jason Dillaman; ceph-devel
Subject: RE: rados read ordering

On thinking of the following use cases and the rbd client..
The ordering scenario is that with client a and b, assuming a and b are messaging between them, then a read posted by client b after client a has successfully written will always read the client a content. Similarly, an ordered write by client a and then by client b followed by a read by a or b will return the client b write content in all cases.
Regards,
Nigel Cook +1 720 319 7508
Sent from a mobile device.
Please excuse both my brevity and sp3lling
On Dec 9, 2014 5:10 PM, "Wang, Zhiqiang" <zhiqiang.wang@intel.com> wrote:
For multiple clients accessing the same volume, same object, I guess the clients need to do some synchronization between them to guarantee what it reads is what it wrote. That is to say, if two clients, say A and B, A is doing write, B is doing read. If B wants to read what A writes, it needs to know A has completed the write before issuing the read. If A hasn't completed the write yet, it could be possible that the write fails, so you can't expect B reads what A writes. I think this makes sense from the client/application perspective.

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
Sent: Wednesday, December 10, 2014 8:43 AM
To: Cook, Nigel
Cc: Yehuda Sadeh; Josh Durgin; Samuel Just; Wang, Zhiqiang; Jason Dillaman; ceph-devel
Subject: Re: rados read ordering

On Wed, 10 Dec 2014, Cook, Nigel wrote:
> Folks
>
> I'm wondering if this is related to the question I posed a few days
> ago..
>
>  Can CEPH support 2 clients simultaneously accessing a single volume -
> for example a database cluster - and honor read and write order of
> blocks across the multiple clients?
>
> Can you comment?

I don't think so.  This (non)change is about a single client submitting two reads to a single object (say, different blocks in the same disk) and whether the OSD is allowed to respond out of order (say, because some blocks are in cache and some aren't).

In the shared volume case, it is generally not important what happens with requests that are submitted in parallel.. they can take different amounts of time on the wire and which happens first (i.e., which arrives at the OSD first) depends on happenstance.  What does matter is that any read that happens after a complete write reflects that read, which is why the concern is around caching.

sage


>
> Regards,
> Nigel Cook +1 720 319 7508
>
> Sent from a mobile device.
> Please excuse both my brevity and sp3lling
>
> On Dec 8, 2014 10:12 AM, Yehuda Sadeh <yehuda@redhat.com> wrote:
> On Mon, Dec 8, 2014 at 9:03 AM, Sage Weil <sweil@redhat.com> wrote:
> > The current RADOS behavior is that reads (on any given object) are always
> > processed in the order they are submitted by the client.  This causes a
> > few headaches for the cache tiering that it would be nice to avoid.  It
> > also occurs to me that there are likely cases where we could go a lot
> > faster by not strictly ordering things.  For example, a stat can respond
> > more quickly than a large read, and some reads may hit cache while others
> > go to disk.  This doesn't happen currently because of the (lame) way we do
> > reads synchronously, but hope that can change too.
> >
> > I propose we drop this semantic.  If a client wants reads to have a strict
> > ordering, they can set the existing RWORDERED flag (which also orders them
> > with respect to writes).  That's not the most general thing ever, but I'm
> > not sure we care about callers who want reads ordered with respect to each
> > other but not writes.
> >
> > The real question is whether there are any users that want/need this
> > currently.  I can't think of any offhand.  In several places we submit
> > multiple *writes* and expect them to be strictly ordered (e.g., we
> > set a completion on teh last write only).  I don't think we do this
> > anywhere for reads though...
> >
> > Josh, Yehuda, Jason--can you think of any in RBD or RGW that would depend
> > on this?
> >
>
> None that I can think of. For objects data, we already stripe it
> across multiple objects, and the underlying assumption is that we're
> going to get responses out of order so we make sure we commit
> in-order. Guards are used on the head object, and the read is
> synchronous there anyway. I can't think of any other place where we'd
> have an issue.
>
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: rados read ordering
  2014-12-10 16:07             ` Cook, Nigel
@ 2014-12-10 16:35               ` Sage Weil
  0 siblings, 0 replies; 12+ messages in thread
From: Sage Weil @ 2014-12-10 16:35 UTC (permalink / raw)
  To: Cook, Nigel
  Cc: Wang, Zhiqiang, Yehuda Sadeh, Josh Durgin, Samuel Just,
	Jason Dillaman, ceph-devel

On Wed, 10 Dec 2014, Cook, Nigel wrote:
> Thanks for the reply on this.
> 
> In my use case its happy path.. All writes work, and in this case I want 
> read and write order honored. I'm not looking for my block subsystem to 
> have eventual consistency.
> 
> For the unhappy path if client b write fails then I'm expecting a 
> subsequent read on client a and b to read the content written by client 
> a.

Yes.
 
> I don't understand the conclusion in your final sentence
> 
> "So my point is, from the client?s perspective, it shouldn?t expect to 
> read the content which hasn?t been successfully written to the storage." 
> I agree with this.
> 
> "Thus it's ok for ceph to not support ordering the reads/writes from 
> multiple clients."
> 
> I don't understand how you reach this conclusion.
> 
> I might alter to say .. Its OK if read request of different blocks 
> respond out of order, but for a given block read/write requests are 
> ordered in time.
> 
> This statement worries me a little. I'm guessing out of order read 
> requests is only OK when the blocks aren't also being actively written. 
> I think its implied in the statement but doesn't come out explicitly.

The reordering I'm talking about is _parallel_ requests that are in flight 
at the same time.  They may be processed out of order.  Normal disks do 
this all day long (reordering things in the request queue to optimize for 
seeks and platter rotation timing).  If a read and a write on the same 
block/object are submitted in parallel (e.g. by different clients) the 
result is ambiguous; clustered file systems are simply careful to never do 
that (or only do it in cases where a race is okay).

The kind of 'read vs write' ordering you're talking about, and the part 
that matters for shared devices, is "read after write" consistency.  That 
is, any read that is initiated *after* a write is completed must reflect 
the result of that write (assuming it succeeded; if it failed then the 
contents of the disk are ambiguous). That is *always* true with rados, and 
is unaffected by the change we're talking about.  Rados will always be 
strongly consistent so that you can use it for things like databases and 
file systems that require strong consistency.

Does that help clarify?

Thanks!
sage


> 
> Regards,
> Nigel Cook +1 720 319 7508
> 
> Sent from a mobile device.
> Please excuse both my brevity and sp3lling
> 
> On Dec 9, 2014 9:08 PM, "Wang, Zhiqiang" <zhiqiang.wang@intel.com> wrote:
> In the 2nd scenario, what if the write by client B fails for some reason? Do we also fail the following read by A or B? Or return the read with the client A write content? If we return the read with client A write content, client A still needs to communicate with B to know which version of the content it reads. So my point is, from the client?s perspective, it shouldn?t expect to read the content which hasn?t been successfully written to the storage. Thus it's ok for ceph to not support ordering the reads/writes from multiple clients.
> 
> From: Cook, Nigel
> Sent: Wednesday, December 10, 2014 12:37 PM
> To: Wang, Zhiqiang
> Cc: Josh Durgin; Yehuda Sadeh; Samuel Just; Sage Weil; Jason Dillaman; ceph-devel
> Subject: RE: rados read ordering
> 
> On thinking of the following use cases and the rbd client..
> The ordering scenario is that with client a and b, assuming a and b are messaging between them, then a read posted by client b after client a has successfully written will always read the client a content. Similarly, an ordered write by client a and then by client b followed by a read by a or b will return the client b write content in all cases.
> Regards,
> Nigel Cook +1 720 319 7508
> Sent from a mobile device.
> Please excuse both my brevity and sp3lling
> On Dec 9, 2014 5:10 PM, "Wang, Zhiqiang" <zhiqiang.wang@intel.com> wrote:
> For multiple clients accessing the same volume, same object, I guess the clients need to do some synchronization between them to guarantee what it reads is what it wrote. That is to say, if two clients, say A and B, A is doing write, B is doing read. If B wants to read what A writes, it needs to know A has completed the write before issuing the read. If A hasn't completed the write yet, it could be possible that the write fails, so you can't expect B reads what A writes. I think this makes sense from the client/application perspective.
> 
> -----Original Message-----
> From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Wednesday, December 10, 2014 8:43 AM
> To: Cook, Nigel
> Cc: Yehuda Sadeh; Josh Durgin; Samuel Just; Wang, Zhiqiang; Jason Dillaman; ceph-devel
> Subject: Re: rados read ordering
> 
> On Wed, 10 Dec 2014, Cook, Nigel wrote:
> > Folks
> >
> > I'm wondering if this is related to the question I posed a few days
> > ago..
> >
> >  Can CEPH support 2 clients simultaneously accessing a single volume -
> > for example a database cluster - and honor read and write order of
> > blocks across the multiple clients?
> >
> > Can you comment?
> 
> I don't think so.  This (non)change is about a single client submitting two reads to a single object (say, different blocks in the same disk) and whether the OSD is allowed to respond out of order (say, because some blocks are in cache and some aren't).
> 
> In the shared volume case, it is generally not important what happens with requests that are submitted in parallel.. they can take different amounts of time on the wire and which happens first (i.e., which arrives at the OSD first) depends on happenstance.  What does matter is that any read that happens after a complete write reflects that read, which is why the concern is around caching.
> 
> sage
> 
> 
> >
> > Regards,
> > Nigel Cook +1 720 319 7508
> >
> > Sent from a mobile device.
> > Please excuse both my brevity and sp3lling
> >
> > On Dec 8, 2014 10:12 AM, Yehuda Sadeh <yehuda@redhat.com> wrote:
> > On Mon, Dec 8, 2014 at 9:03 AM, Sage Weil <sweil@redhat.com> wrote:
> > > The current RADOS behavior is that reads (on any given object) are always
> > > processed in the order they are submitted by the client.  This causes a
> > > few headaches for the cache tiering that it would be nice to avoid.  It
> > > also occurs to me that there are likely cases where we could go a lot
> > > faster by not strictly ordering things.  For example, a stat can respond
> > > more quickly than a large read, and some reads may hit cache while others
> > > go to disk.  This doesn't happen currently because of the (lame) way we do
> > > reads synchronously, but hope that can change too.
> > >
> > > I propose we drop this semantic.  If a client wants reads to have a strict
> > > ordering, they can set the existing RWORDERED flag (which also orders them
> > > with respect to writes).  That's not the most general thing ever, but I'm
> > > not sure we care about callers who want reads ordered with respect to each
> > > other but not writes.
> > >
> > > The real question is whether there are any users that want/need this
> > > currently.  I can't think of any offhand.  In several places we submit
> > > multiple *writes* and expect them to be strictly ordered (e.g., we
> > > set a completion on teh last write only).  I don't think we do this
> > > anywhere for reads though...
> > >
> > > Josh, Yehuda, Jason--can you think of any in RBD or RGW that would depend
> > > on this?
> > >
> >
> > None that I can think of. For objects data, we already stripe it
> > across multiple objects, and the underlying assumption is that we're
> > going to get responses out of order so we make sure we commit
> > in-order. Guards are used on the head object, and the read is
> > synchronous there anyway. I can't think of any other place where we'd
> > have an issue.
> >
> > Yehuda
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-12-10 16:35 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-08 17:03 rados read ordering Sage Weil
2014-12-08 17:11 ` Yehuda Sadeh
2014-12-10  0:37   ` Cook, Nigel
2014-12-10  0:43     ` Sage Weil
2014-12-10  1:10       ` Wang, Zhiqiang
2014-12-10  2:26         ` Haomai Wang
2014-12-10  2:51           ` Wang, Zhiqiang
2014-12-10  4:36         ` Cook, Nigel
2014-12-10  5:08           ` Wang, Zhiqiang
2014-12-10 16:07             ` Cook, Nigel
2014-12-10 16:35               ` Sage Weil
2014-12-08 23:38 ` Josh Durgin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.