From mboxrd@z Thu Jan 1 00:00:00 1970 From: Josh Durgin Subject: Re: [PATCH] libceph: apply new_state before new_up_client on incrementals Date: Thu, 21 Jul 2016 12:22:55 -0700 Message-ID: <5791210F.8040408@redhat.com> References: <1468963597-10667-1-git-send-email-idryomov@gmail.com> <57901170.4000809@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mx1.redhat.com ([209.132.183.28]:41487 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753848AbcGUTW4 (ORCPT ); Thu, 21 Jul 2016 15:22:56 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Ilya Dryomov Cc: Ceph Development On 07/21/2016 08:27 AM, Ilya Dryomov wrote: > On Thu, Jul 21, 2016 at 2:04 AM, Josh Durgin wrote: >> On 07/19/2016 02:26 PM, Ilya Dryomov wrote: >>> >>> Currently, osd_weight and osd_state fields are updated in the encoding >>> order. This is wrong, because an incremental map may look like e.g. >>> >>> new_up_client: { osd=6, addr=... } # set osd_state and addr >>> new_state: { osd=6, xorstate=EXISTS } # clear osd_state >>> >>> Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down). After >>> applying new_up_client, osd_state is changed to EXISTS | UP. Carrying >>> on with the new_state update, we flip EXISTS and leave osd6 in a weird >>> "!EXISTS but UP" state. A non-existent OSD is considered down by the >>> mapping code >>> >>> 2087 for (i = 0; i < pg->pg_temp.len; i++) { >>> 2088 if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) { >>> 2089 if (ceph_can_shift_osds(pi)) >>> 2090 continue; >>> 2091 >>> 2092 temp->osds[temp->size++] = CRUSH_ITEM_NONE; >>> >>> and so requests get directed to the second OSD in the set instead of >>> the first, resulting in OSD-side errors like: >>> >>> [WRN] : client.4239 192.168.122.21:0/2444980242 misdirected >>> client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680 >>> >>> and hung rbds on the client: >>> >>> [ 493.566367] rbd: rbd0: write 400000 at 11cc00000 (0) >>> [ 493.566805] rbd: rbd0: result -6 xferred 400000 >>> [ 493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688 >>> >>> The fix is to decouple application from the decoding and: >>> - apply new_weight first >>> - apply new_state before new_up_client >>> - twiddle osd_state flags if marking in >>> - clear out some of the state if osd is destroyed >>> >>> Fixes: http://tracker.ceph.com/issues/14901 >>> >>> Cc: stable@vger.kernel.org # 3.15+: 6dd74e44dc1d: libceph: set 'exists' >>> flag for newly up osd >>> Cc: stable@vger.kernel.org # 3.15+ >>> Signed-off-by: Ilya Dryomov >>> --- >>> net/ceph/osdmap.c | 156 >>> +++++++++++++++++++++++++++++++++++++++--------------- >>> 1 file changed, 113 insertions(+), 43 deletions(-) >> >> >> This matches userspace's incremental handling now. Looks good. > > Technically, no. new_primary_affinity is decoded along with new_weight > in userspace and crushmap is applied at the very end, after weights and > states are applied. AFAICT crushmap is applied last only because of > the userspace-only adjust_osd_weights(), which needs to have the new > weights. > > As for primary-affinity, the following > > new_up_client: { osd=X, addr=... } > new_state: { osd=X, xorstate=EXISTS } > new_primary_affinity: { osd=X, aff=Y } > > would be misinterpreted - the kernel client would stay on aff=Y instead > of aff=1 (reset), but I don't see a (obvious, at least) way to get such > a map, and, as I wanted this to be as small as possible for backporting, > I punted on that. Let me know if you see otherwise. Yeah, I also see no way for this to happen, and it isn't likely to be needed in the future either. So keeping this update small sounds good to me. Josh