Re: [PATCH] libceph: apply new_state before new_up_client on incrementals

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Josh Durgin <jdurgin@redhat.com>
To: Ilya Dryomov <idryomov@gmail.com>
Cc: Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: [PATCH] libceph: apply new_state before new_up_client on incrementals
Date: Thu, 21 Jul 2016 12:22:55 -0700	[thread overview]
Message-ID: <5791210F.8040408@redhat.com> (raw)
In-Reply-To: <CAOi1vP-BOfzQpNH1yRaZO78F0aT3ictmWJ_UO-9K1bsQDriM_w@mail.gmail.com>

On 07/21/2016 08:27 AM, Ilya Dryomov wrote:
> On Thu, Jul 21, 2016 at 2:04 AM, Josh Durgin <jdurgin@redhat.com> wrote:
>> On 07/19/2016 02:26 PM, Ilya Dryomov wrote:
>>>
>>> Currently, osd_weight and osd_state fields are updated in the encoding
>>> order.  This is wrong, because an incremental map may look like e.g.
>>>
>>>       new_up_client: { osd=6, addr=... } # set osd_state and addr
>>>       new_state: { osd=6, xorstate=EXISTS } # clear osd_state
>>>
>>> Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down).  After
>>> applying new_up_client, osd_state is changed to EXISTS | UP.  Carrying
>>> on with the new_state update, we flip EXISTS and leave osd6 in a weird
>>> "!EXISTS but UP" state.  A non-existent OSD is considered down by the
>>> mapping code
>>>
>>> 2087    for (i = 0; i < pg->pg_temp.len; i++) {
>>> 2088            if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
>>> 2089                    if (ceph_can_shift_osds(pi))
>>> 2090                            continue;
>>> 2091
>>> 2092                    temp->osds[temp->size++] = CRUSH_ITEM_NONE;
>>>
>>> and so requests get directed to the second OSD in the set instead of
>>> the first, resulting in OSD-side errors like:
>>>
>>> [WRN] : client.4239 192.168.122.21:0/2444980242 misdirected
>>> client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680
>>>
>>> and hung rbds on the client:
>>>
>>> [  493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
>>> [  493.566805] rbd: rbd0:   result -6 xferred 400000
>>> [  493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688
>>>
>>> The fix is to decouple application from the decoding and:
>>> - apply new_weight first
>>> - apply new_state before new_up_client
>>> - twiddle osd_state flags if marking in
>>> - clear out some of the state if osd is destroyed
>>>
>>> Fixes: http://tracker.ceph.com/issues/14901
>>>
>>> Cc: stable@vger.kernel.org # 3.15+: 6dd74e44dc1d: libceph: set 'exists'
>>> flag for newly up osd
>>> Cc: stable@vger.kernel.org # 3.15+
>>> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
>>> ---
>>>    net/ceph/osdmap.c | 156
>>> +++++++++++++++++++++++++++++++++++++++---------------
>>>    1 file changed, 113 insertions(+), 43 deletions(-)
>>
>>
>> This matches userspace's incremental handling now. Looks good.
>
> Technically, no.  new_primary_affinity is decoded along with new_weight
> in userspace and crushmap is applied at the very end, after weights and
> states are applied.  AFAICT crushmap is applied last only because of
> the userspace-only adjust_osd_weights(), which needs to have the new
> weights.
>
> As for primary-affinity, the following
>
>      new_up_client: { osd=X, addr=... }
>      new_state: { osd=X, xorstate=EXISTS }
>      new_primary_affinity: { osd=X, aff=Y }
>
> would be misinterpreted - the kernel client would stay on aff=Y instead
> of aff=1 (reset), but I don't see a (obvious, at least) way to get such
> a map, and, as I wanted this to be as small as possible for backporting,
> I punted on that.  Let me know if you see otherwise.

Yeah, I also see no way for this to happen, and it isn't likely to be
needed in the future either. So keeping this update small sounds good
to me.

Josh

     prev parent reply	other threads:[~2016-07-21 19:22 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-07-19 21:26 [PATCH] libceph: apply new_state before new_up_client on incrementals Ilya Dryomov
2016-07-21  0:04 ` Josh Durgin
2016-07-21 15:27   ` Ilya Dryomov
2016-07-21 19:22     ` Josh Durgin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5791210F.8040408@redhat.com \
    --to=jdurgin@redhat.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=idryomov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.