CEPH filesystem development
 help / color / mirror / Atom feed
From: Alex Elder <alex.elder@linaro.org>
To: Sage Weil <sage@inktank.com>
Cc: "Yan, Zheng" <zheng.z.yan@intel.com>, ceph-devel@vger.kernel.org
Subject: Re: [PATCH 1/3] libceph: call r_unsafe_callback when unsafe reply is received
Date: Tue, 02 Jul 2013 13:11:58 -0500	[thread overview]
Message-ID: <51D317EE.60105@linaro.org> (raw)
In-Reply-To: <alpine.DEB.2.00.1307021104560.3375@cobra.newdream.net>

On 07/02/2013 01:10 PM, Sage Weil wrote:
> On Tue, 2 Jul 2013, Alex Elder wrote:
>> On 06/24/2013 01:41 AM, Yan, Zheng wrote:
>>> From: "Yan, Zheng" <zheng.z.yan@intel.com>
>>
>> Sorry it took so long, I intended to take a look at this
>> for you sooner.
>>
>> I would also like to thank you for this nice clear
>> description.  It made it very easy to understand
>> why you were proposing the change, and to focus in
>> on exactly which parts of the design it's affecting.
>>
>>> We can't use !req->r_sent to check if OSD request is sent for the
>>> first time, this is because __cancel_request() zeros req->r_sent
>>> when OSD map changes. Rather than adding a new variable to struct
>>
>> You're right.
>>
>>> ceph_osd_request to indicate if it's sent for the first time, We
>>> can call the unsafe callback only when unsafe OSD reply is received.
>>> If OSD's first reply is safe, just skip calling the unsafe callback.
>>
>> This seems reasonable, but it's different from the way I
>> thought about what constituted "unsafe."  But I may be
>> wrong, and the way this is used by the file system might
>> do something that addresses my concern.
>>
>> The way I interpreted "unsafe" was simply that it was possible
>> a write *could* have been made persistent, even if the client
>> doesn't know about it.  A request could have made it to its
>> target osd, been written, and the response could be in flight
>> at the point something (maybe a router?) crashes and the response
>> gets lost.  During that time window, the stored data may not be
>> in a state that's consistent with the client's view of it.
>>
>> So I thought of "unsafe" as meaning that a write is in flight,
>> and until we get a successful response, the storage might
>> contain the old data or it might contain the new data; the
>> client has no way of knowing which.
>>
>> With that interpretation, a request becomes unsafe the
>> instant it leaves the client, and becomes safe again
>> the instant a response arrives.
>>
>> If my interpretation is correct, this change is wrong.
> 
> The interpretation is correct, but in this case it doesn't matter.  There 
> are two intervals:
> 
>  - write(2) starts
>  - request is sent
>   <interval 1>
>  - got ack reply, write(2) returns
>   <interval 2>
>  - got commit reply
> 
> The important end result is that we need to wait for requests in interval 
> 2 if we fsync().  With your 'unsafe' definition, we *also* wait for 
> syscalls that haven't returned yet, but this isn't necessary... fsync() 
> need only wait for completed but uncommitted writes, not racing ones.  We 
> could quibble about better naming, but the end result is correct.

OK, sounds good to me.  In that case you can include this if you like:

Reviewed-by: Alex Elder <elder@linaro.org>

> sage
> 
> 
>>
>> But I may be wrong, and there may really be no need to
>> worry about a possible modification of data until after
>> an acknowledgement response is received.  In that case,
>> I've looked at your patch and it looks good.
>>
>> Can you explain why I'm wrong about what is "unsafe?"
>>
>> 					-Alex
>>
>>> The purpose of unsafe callback is adding unsafe request to a list,
>>> so that fsync(2) can wait for the safe reply. fsync(2) doesn't need
>>> to wait for a write(2) that hasn't returned yet. So it's OK to add
>>> request to the unsafe list when the first OSD reply is received.
>>> (ceph_sync_write() returns after receiving the first OSD reply)
>>>
>>> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
>>> ---
>>>  net/ceph/osd_client.c | 14 +++++++-------
>>>  1 file changed, 7 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
>>> index 540dd29..dd47889 100644
>>> --- a/net/ceph/osd_client.c
>>> +++ b/net/ceph/osd_client.c
>>> @@ -1337,10 +1337,6 @@ static void __send_request(struct ceph_osd_client *osdc,
>>>  
>>>  	ceph_msg_get(req->r_request); /* send consumes a ref */
>>>  
>>> -	/* Mark the request unsafe if this is the first timet's being sent. */
>>> -
>>> -	if (!req->r_sent && req->r_unsafe_callback)
>>> -		req->r_unsafe_callback(req, true);
>>>  	req->r_sent = req->r_osd->o_incarnation;
>>>  
>>>  	ceph_con_send(&req->r_osd->o_con, req->r_request);
>>> @@ -1431,8 +1427,6 @@ static void handle_osds_timeout(struct work_struct *work)
>>>  
>>>  static void complete_request(struct ceph_osd_request *req)
>>>  {
>>> -	if (req->r_unsafe_callback)
>>> -		req->r_unsafe_callback(req, false);
>>>  	complete_all(&req->r_safe_completion);  /* fsync waiter */
>>>  }
>>>  
>>> @@ -1559,14 +1553,20 @@ static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg,
>>>  	mutex_unlock(&osdc->request_mutex);
>>>  
>>>  	if (!already_completed) {
>>> +		if (req->r_unsafe_callback &&
>>> +		    result >= 0 && !(flags & CEPH_OSD_FLAG_ONDISK))
>>> +			req->r_unsafe_callback(req, true);
>>>  		if (req->r_callback)
>>>  			req->r_callback(req, msg);
>>>  		else
>>>  			complete_all(&req->r_completion);
>>>  	}
>>>  
>>> -	if (flags & CEPH_OSD_FLAG_ONDISK)
>>> +	if (flags & CEPH_OSD_FLAG_ONDISK) {
>>> +		if (req->r_unsafe_callback && already_completed)
>>> +			req->r_unsafe_callback(req, false);
>>>  		complete_request(req);
>>> +	}
>>>  
>>>  done:
>>>  	dout("req=%p req->r_linger=%d\n", req, req->r_linger);
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>


      reply	other threads:[~2013-07-02 18:11 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-24  6:41 [PATCH 1/3] libceph: call r_unsafe_callback when unsafe reply is received Yan, Zheng
2013-06-24  6:41 ` [PATCH 2/3] mds: fix cap revoke race Yan, Zheng
2013-06-24  8:00   ` Yan, Zheng
2013-06-24  8:19   ` Yan, Zheng
2013-06-24  6:41 ` [PATCH 3/3] mds: fix race between cap issue and revoke Yan, Zheng
2013-06-24  8:16   ` Yan, Zheng
2013-07-01  7:28 ` [PATCH 1/3] libceph: call r_unsafe_callback when unsafe reply is received Yan, Zheng
2013-07-01 19:46   ` Sage Weil
2013-07-03 21:57     ` Sage Weil
2013-07-03 22:07       ` Milosz Tanski
2013-07-03 22:10         ` Sage Weil
2013-07-03 22:43         ` Yan, Zheng
2013-07-08 14:42           ` Milosz Tanski
2013-07-08 19:58             ` Milosz Tanski
2013-07-08 20:30               ` Yan, Zheng
2013-07-08 21:16                 ` Milosz Tanski
2013-07-25 15:43                   ` Milosz Tanski
2013-07-03 22:18       ` Alex Elder
2013-07-03 22:22       ` Yan, Zheng
2013-07-03 22:26         ` Sage Weil
2013-07-03 22:32           ` Sage Weil
2013-07-02 13:07 ` Alex Elder
2013-07-02 14:27   ` Yan, Zheng
2013-07-02 18:10   ` Sage Weil
2013-07-02 18:11     ` Alex Elder [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51D317EE.60105@linaro.org \
    --to=alex.elder@linaro.org \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sage@inktank.com \
    --cc=zheng.z.yan@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox