Backporting stability fixes for ceph-disk

All of lore.kernel.org
 help / color / mirror / Atom feed

* Backporting stability fixes for ceph-disk
@ 2016-02-02  5:53 Loic Dachary
  2016-02-03 17:56 ` Ken Dreyer
  0 siblings, 1 reply; 5+ messages in thread
From: Loic Dachary @ 2016-02-02  5:53 UTC (permalink / raw)
  To: Ken Dreyer; +Cc: Ceph Development

Hi Ken,

https://github.com/ceph/ceph/pull/6926 and https://github.com/ceph/ceph/pull/5999 fixed a number of stability problems related to the udev / ceph-disk / initsystem code path. I'm now convinced (after a few weeks with no surprising failures when running the ceph-disk teuthology suite) that we have something stable. I'm not saying all problems have been found and fixed. But at least we have something stable and repeatable to work with. I've used it as a based for a partial refactor of ceph-disk to support Bluestore ( https://github.com/ceph/ceph/pull/7218 ).

I think all stability fixes have been backported to infernalis ( https://github.com/ceph/ceph/pull/7001 etc. ). Unfortunately backporting to hammer can't be done by cherry-picking commits from https://github.com/ceph/ceph/pull/6926 and https://github.com/ceph/ceph/pull/5999. In hammer things go like this:

    * ceph-disk prepare
    * triggers a udev event
    * udev action runs ceph-disk activate
    * ceph-disk activate run ceph-osd via the init system

In infernalis Sage implemented an intermediate step so that the udev action does as little as possible:

   * ceph-disk prepare
   * triggers a udev event
   * udev action asynchronously delegates activation to the init system
   * the init system runs ceph-disk activate
   * ceph-disk activate run ceph-osd via the init system

This helps with stability because ceph-disk activate may trigger udev events, which is not recommended when running as a child process of a udev action and also because ceph-disk activate may take minutes to complete in some cases. Backporting this logic to hammer would require shipping new init files (ceph-disk unit for systemd for instance) and new udev rules (to call ceph-disk trigger instead of ceph-disk activate to add the delegation step).

The conservative approach to the problem would be to cherry-pick what we can ( https://github.com/dachary/ceph/commit/9dce05a8cdfc564c5162885bbb67a04ad7b95c5a for instance ) and document known side effects of ceph-disk instability so people know it's an annoyance but nothing destructive or blocking. In the worst case scenario, deactivating the udev rules and running ceph-disk prepare + ceph-disk activate manually or by writing a script that does things sequentially is a viable workaround.

The better approach would be to backport the udev / init system changes together with most of what ceph-disk is in infernalis. Not only would that solve the problems we know about, but it would give us a solid ground to fix future problems. It is unfortunately, IMHO, too much of a risk at this stage of the hammer release.

I'm quite conflicted about how to approach that in a sane way and your input would be most precious.

Cheers

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Backporting stability fixes for ceph-disk
  2016-02-02  5:53 Backporting stability fixes for ceph-disk Loic Dachary
@ 2016-02-03 17:56 ` Ken Dreyer
  2016-02-03 19:10   ` Loic Dachary
  0 siblings, 1 reply; 5+ messages in thread
From: Ken Dreyer @ 2016-02-03 17:56 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

Hi Loic,

Thanks for explaining the differences between Hammer's disk
activations and Jewel's. I think I understand the problem better now.

On Mon, Feb 1, 2016 at 10:53 PM, Loic Dachary <ldachary@redhat.com> wrote:
> The conservative approach to the problem would be to cherry-pick what
> we can (
> https://github.com/dachary/ceph/commit/9dce05a8cdfc564c5162885bbb67a04ad7b95c5a
> for instance ) and document known side effects of ceph-disk
> instability so people know it's an annoyance but nothing destructive
> or blocking. In the worst case scenario, deactivating the udev rules
> and running ceph-disk prepare + ceph-disk activate manually or by
> writing a script that does things sequentially is a viable workaround.

This approach (documentation) sounds reasonable to me, and it makes
sense that the larger re-architecture of running "ceph-disk activate"
outside udev is only something that can happen in a major release
boundary (in this case Infernalis / Jewel). Once we're happy that the
docs for manually recovering are solid, we can possibly address it
with a script as you suggest.

If we can document the worst case scenario and what to do when
ceph-disk-in-udev fails, that would really improve the user
experience.

What's the procedure for deactivating the Hammer udev rules, for example?

- Ken

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Backporting stability fixes for ceph-disk
  2016-02-03 17:56 ` Ken Dreyer
@ 2016-02-03 19:10   ` Loic Dachary
  2016-02-04  3:13     ` Ken Dreyer
  0 siblings, 1 reply; 5+ messages in thread
From: Loic Dachary @ 2016-02-03 19:10 UTC (permalink / raw)
  To: Ken Dreyer; +Cc: Ceph Development



On 04/02/2016 00:56, Ken Dreyer wrote:
> Hi Loic,
> 
> Thanks for explaining the differences between Hammer's disk
> activations and Jewel's. I think I understand the problem better now.
> 
> On Mon, Feb 1, 2016 at 10:53 PM, Loic Dachary <ldachary@redhat.com> wrote:
>> The conservative approach to the problem would be to cherry-pick what
>> we can (
>> https://github.com/dachary/ceph/commit/9dce05a8cdfc564c5162885bbb67a04ad7b95c5a
>> for instance ) and document known side effects of ceph-disk
>> instability so people know it's an annoyance but nothing destructive
>> or blocking. In the worst case scenario, deactivating the udev rules
>> and running ceph-disk prepare + ceph-disk activate manually or by
>> writing a script that does things sequentially is a viable workaround.
> 
> This approach (documentation) sounds reasonable to me, and it makes
> sense that the larger re-architecture of running "ceph-disk activate"
> outside udev is only something that can happen in a major release
> boundary (in this case Infernalis / Jewel). Once we're happy that the
> docs for manually recovering are solid, we can possibly address it
> with a script as you suggest.

The script really is just adding a call to ceph-disk activate-all at boot time somewhere (/etc/rc.local maybe ?).

> If we can document the worst case scenario and what to do when
> ceph-disk-in-udev fails, that would really improve the user
> experience.
> 
> What's the procedure for deactivating the Hammer udev rules, for example?

rm /lib/udev/rules.d/*ceph*
udevadm control --reload # maybe superfluous

> 
> - Ken
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Backporting stability fixes for ceph-disk
  2016-02-03 19:10   ` Loic Dachary
@ 2016-02-04  3:13     ` Ken Dreyer
  2016-02-04  5:18       ` Loic Dachary
  0 siblings, 1 reply; 5+ messages in thread
From: Ken Dreyer @ 2016-02-04  3:13 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Wed, Feb 3, 2016 at 12:10 PM, Loic Dachary <loic@dachary.org> wrote:
> On 04/02/2016 00:56, Ken Dreyer wrote:
>> What's the procedure for deactivating the Hammer udev rules, for example?
>
> rm /lib/udev/rules.d/*ceph*
> udevadm control --reload # maybe superfluous
>

I am surprised to see that we'd want to delete files from /lib. How
would the user restore them afterwards? Sorry if this sounds dense;
I'm definitely a udev noob. Could you provide a "starting from
scratch" procedure for how to handle ceph-disk failures in Hammer?

- Ken

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Backporting stability fixes for ceph-disk
  2016-02-04  3:13     ` Ken Dreyer
@ 2016-02-04  5:18       ` Loic Dachary
  0 siblings, 0 replies; 5+ messages in thread
From: Loic Dachary @ 2016-02-04  5:18 UTC (permalink / raw)
  To: Ken Dreyer; +Cc: Ceph Development

On 04/02/2016 10:13, Ken Dreyer wrote:
> On Wed, Feb 3, 2016 at 12:10 PM, Loic Dachary <loic@dachary.org> wrote:
>> On 04/02/2016 00:56, Ken Dreyer wrote:
>>> What's the procedure for deactivating the Hammer udev rules, for example?
>>
>> rm /lib/udev/rules.d/*ceph*
>> udevadm control --reload # maybe superfluous
>>
> 
> I am surprised to see that we'd want to delete files from /lib. How
> would the user restore them afterwards? 

re-installing the ceph package that contains them will restore them.

> Sorry if this sounds dense;
> I'm definitely a udev noob. Could you provide a "starting from
> scratch" procedure for how to handle ceph-disk failures in Hammer?

My own bias is to understand why things go wrong before fixing them, which can be complicated when udev / initsystem / ceph-disk are involved. To this date I would still not be able to write a guide explaining how to do that reliably. Only recently did I discover that messages that should be in syslog could be discarded entirely on RHEL, unless the abrt package is installed. After which you have to know to collect the output from a file that is referenced in the syslog messages but not in the messages themselves.

If there is a suspicion that udev / initsystem / ceph-disk is not doing the right thing with hammer and understanding why is secondary, I would recommend removing the udev rules and doing things manually as suggested in the previous mail. Whenever there is a problem, it's usually not because individual components are at fault, it's because they race with each other in ways that were not fully understood back in hammer.

The most frequent mistake is thinking that more partprobe / partx is better and fixes things. It's actually the opposite: when the udev rules are in play, running more partprobe / partx will create new udev events that will race with those already in flight (see http://tracker.ceph.com/issues/14099 for instance). It can do even worse: partprobe /dev/sdb will remove existing partitions before adding them again, to be extra sure the kernel has an accurate view of the partition table. I let you imagine what that can do on a live system. partx does not have that problem but that's because it assumes the caller knows exactly what information the kernel has about the partition table. That leads to confusing situations when, for instance, a partition is added, partx called to notify the kernel which fires a udev event, partition is deleted and the caller fails to notify the kernel. If the same partition is added again, partx notifies the kernel which does nothing instead of firing a udev ev
 e
nt because the partition still exists from its point of view.

In hammer partprobe was not consistently guarded against such races (it's enough to udevadm settle ; partprobe ; udevadm settle but that was not done consistently) and had to call partprobe / partx more than once, for instance right after a journal partition was created and before creating the data partition. Calls to partprobe and udevadm settle also need to be more patient than the default, specially when dmcrypt is in play. What it means in practice is that ceph-disk must call udevadm settle --timeout=600 and call partprobe a few times before declaring failure (there is no user control over the partprobe timeout). The ceph-disk suite routinely shows partprobe try two or three times at 60 seconds intervals before succeeding (this is extreme because it happens in a cloud environment where performances vary a lot).

All these trouble go away if udev is deactivated because partprobe won't run ceph-disk indirectly. The timeout issue may still be a concern but I think that in real life situations, if ceph-disk prepare is done first and a separate script does the ceph-disk activate-all, the odds that ceph-disk activate fails because a partprobe run by ceph-disk prepare did not complete are very low. An automated script could do:

ceph-disk prepare /dev/sdb
ceph-disk prepare /dev/sdc
ceph-disk prepare /dev/sdd
...
udevadm settle --timeout=600
ceph-disk activate /dev/sdb1
ceph-disk activate /dev/sdc1
ceph-disk activate /dev/sdd1
...

I hope that clarifies the situation ?

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-02-04  5:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-02-02  5:53 Backporting stability fixes for ceph-disk Loic Dachary
2016-02-03 17:56 ` Ken Dreyer
2016-02-03 19:10   ` Loic Dachary
2016-02-04  3:13     ` Ken Dreyer
2016-02-04  5:18       ` Loic Dachary

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.