All of lore.kernel.org
 help / color / mirror / Atom feed
* Suspend and the ceph clients
@ 2014-05-14 22:07 Gregory Farnum
  2014-05-15  8:13 ` Holger Hoffstätte
  2014-05-15 14:29 ` Sage Weil
  0 siblings, 2 replies; 5+ messages in thread
From: Gregory Farnum @ 2014-05-14 22:07 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org; +Cc: Sage Weil, Zheng Yan

There's a recent ticket discussing the behavior of ceph-fuse after the
machine it's running on has been suspended:
http://tracker.ceph.com/issues/8291

In short, CephFS clients which are disconnected from the cluster for a
sufficiently long time are generally forbidden from reconnecting —
after a configurable timeout, their "capabilities" on inodes and
dentries are revoked, and other users are allowed to change them. If
the client then comes back, it's entirely possible it has incompatible
changes to the tree, so we don't let it reconnect in order to prevent
that. (We could make the system smarter in some situations, if for
instance nobody has changed the given filesystem data in the
meanwhile, but that's hard and a problem for another day.)

Apparently, ceph-fuse does exactly this, as we expect (although we
have newly-merged features which let the admin force a reconnect). But
the kernel client does allow a reconnect. I haven't done this myself,
so the first question is just a fact check for Sage or Zheng:
1) What is the kernel client doing after suspend? Does it in fact
reconnect under situations where ceph-fuse won't, and what are they?

More interestingly, while suspended systems aren't part of our normal
target use case, they'd be nice to support well. The trivial solution
would be to somehow flush out all dirty data on suspend, and then on
wake or when we discover we have a reset session, we can clean out our
cache and reconnect as a new client if we have no dirty data.
Unfortunately, I don't know anything about Linux's suspend
functionality or APIs, and my weak attempts at googling and grepping
aren't turning anything up. So a question to everybody:

2) What notifications does Linux send, and what filesystem mechanisms
does it invoke, when it is suspending?
I see that it has in the past forced a sync whenever suspending, but I
think that's no longer required. Are there other interfaces we can
rely on, or use heuristically?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Suspend and the ceph clients
  2014-05-14 22:07 Suspend and the ceph clients Gregory Farnum
@ 2014-05-15  8:13 ` Holger Hoffstätte
  2014-05-15  8:28   ` Holger Hoffstätte
  2014-05-15 14:29 ` Sage Weil
  1 sibling, 1 reply; 5+ messages in thread
From: Holger Hoffstätte @ 2014-05-15  8:13 UTC (permalink / raw)
  To: ceph-devel

On Wed, 14 May 2014 15:07:44 -0700, Gregory Farnum wrote:
> [..]
> Unfortunately, I don't know anything about Linux's suspend
> functionality or APIs, and my weak attempts at googling and grepping
> aren't turning anything up. So a question to everybody:
> 
> 2) What notifications does Linux send, and what filesystem mechanisms
> does it invoke, when it is suspending?

Is this what you're looking for?
https://www.kernel.org/doc/Documentation/power/freezing-of-tasks.txt

User space is usually controlled by ("legacy") pm-utils, which simply
executes a bunch of scripts (packaged and user provided) in various
stages. It works reasonably well but is of course fragile - typical
scripted duct tape.

systemd has its own (IMHO much less fragile) way of doing things:
http://www.freedesktop.org/software/systemd/man/systemd-sleep.conf.html

Hope this helps.

-h


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Suspend and the ceph clients
  2014-05-15  8:13 ` Holger Hoffstätte
@ 2014-05-15  8:28   ` Holger Hoffstätte
  0 siblings, 0 replies; 5+ messages in thread
From: Holger Hoffstätte @ 2014-05-15  8:28 UTC (permalink / raw)
  To: ceph-devel


Forgot something..

> User space is usually controlled by ("legacy") pm-utils, which simply

The power management events themselves are typically handled via acpid,
which then executes scripts associated with them (i.e. things provided
by pm-utils). Division of responsibilities.
Not sure how PM would work without ACPI.

-h


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Suspend and the ceph clients
  2014-05-14 22:07 Suspend and the ceph clients Gregory Farnum
  2014-05-15  8:13 ` Holger Hoffstätte
@ 2014-05-15 14:29 ` Sage Weil
  2014-05-15 22:19   ` Gregory Farnum
  1 sibling, 1 reply; 5+ messages in thread
From: Sage Weil @ 2014-05-15 14:29 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel@vger.kernel.org, Zheng Yan

On Wed, 14 May 2014, Gregory Farnum wrote:
> There's a recent ticket discussing the behavior of ceph-fuse after the
> machine it's running on has been suspended:
> http://tracker.ceph.com/issues/8291
> 
> In short, CephFS clients which are disconnected from the cluster for a
> sufficiently long time are generally forbidden from reconnecting ?
> after a configurable timeout, their "capabilities" on inodes and
> dentries are revoked, and other users are allowed to change them. If
> the client then comes back, it's entirely possible it has incompatible
> changes to the tree, so we don't let it reconnect in order to prevent
> that. (We could make the system smarter in some situations, if for
> instance nobody has changed the given filesystem data in the
> meanwhile, but that's hard and a problem for another day.)
> 
> Apparently, ceph-fuse does exactly this, as we expect (although we
> have newly-merged features which let the admin force a reconnect). But
> the kernel client does allow a reconnect. I haven't done this myself,
> so the first question is just a fact check for Sage or Zheng:
> 1) What is the kernel client doing after suspend? Does it in fact
> reconnect under situations where ceph-fuse won't, and what are they?

It looks to me like it is making a blind attempt to reconnect via 
peer_reset(), which is probably wrong.  Haven't thought through it, 
though.

There is an ancient ticket to make the client do a best-effort reconnect 
after the MDS reconnect period, but it's a hard to impossible task.

For me, the minimum that we need to support well today is to make it 
clearly visible on the client whether or now we were disconnected so that 
any applications or humans using that mount can tell what happened.  
Zheng's patch for ceph-fuse that added the STALE state accomplishes this 
(by dumping mds_sessions on the ceph-fuse admin socket), and I backported 
just that patch to firefly (and dumpling? I forget).

I think we should do the same thing for the kernel client so that you can 
look in /sys/kernel/debug/ceph/*/mdsc to get the same info.

> More interestingly, while suspended systems aren't part of our normal
> target use case, they'd be nice to support well. The trivial solution
> would be to somehow flush out all dirty data on suspend, and then on
> wake or when we discover we have a reset session, we can clean out our
> cache and reconnect as a new client if we have no dirty data.

This will at least avoid losing client data, but I think it will take 
significant work to keep the client mount alive in any meaningful way.  
Even if all of the cache contents (including dentry) are blown away, 
there are still open files that may not exist afterwards, so at a minimum 
there needs to be a way to identify and mark those deleted inode refs 
as stale at reconnect time.  Perhaps it could all be a client-side thing 
based on fresh MDS sessions and open-by-ino?

> Unfortunately, I don't know anything about Linux's suspend
> functionality or APIs, and my weak attempts at googling and grepping
> aren't turning anything up. So a question to everybody:
> 
> 2) What notifications does Linux send, and what filesystem mechanisms
> does it invoke, when it is suspending?
> I see that it has in the past forced a sync whenever suspending, but I
> think that's no longer required. Are there other interfaces we can
> rely on, or use heuristically?

There is a bunch of in-kernel infrastructure for doing sleep/wake stuff.  
For userspace, it sounds like Holger's systemd pointer is the most 
promising?

sage

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Suspend and the ceph clients
  2014-05-15 14:29 ` Sage Weil
@ 2014-05-15 22:19   ` Gregory Farnum
  0 siblings, 0 replies; 5+ messages in thread
From: Gregory Farnum @ 2014-05-15 22:19 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org, Zheng Yan

On Thu, May 15, 2014 at 1:13 AM, Holger Hoffstätte
<holger.hoffstaette@googlemail.com> wrote:
> On Wed, 14 May 2014 15:07:44 -0700, Gregory Farnum wrote:
>> [..]
>> Unfortunately, I don't know anything about Linux's suspend
>> functionality or APIs, and my weak attempts at googling and grepping
>> aren't turning anything up. So a question to everybody:
>>
>> 2) What notifications does Linux send, and what filesystem mechanisms
>> does it invoke, when it is suspending?
>
> Is this what you're looking for?
> https://www.kernel.org/doc/Documentation/power/freezing-of-tasks.txt

Well, it's the kernel interface, but it's not very useful for ceph-fuse...

> User space is usually controlled by ("legacy") pm-utils, which simply
> executes a bunch of scripts (packaged and user provided) in various
> stages. It works reasonably well but is of course fragile - typical
> scripted duct tape.
>
> systemd has its own (IMHO much less fragile) way of doing things:
> http://www.freedesktop.org/software/systemd/man/systemd-sleep.conf.html

But these pointers are great. It looks like both systems let us just
drop an executable into the appropriate directory and it will wait for
those to complete before continuing, so we can send ceph-fuse admin
socket messages to prepare and flush data, or whatever. Thanks!

On Thu, May 15, 2014 at 7:29 AM, Sage Weil <sage@inktank.com> wrote:
> On Wed, 14 May 2014, Gregory Farnum wrote:
>> 1) What is the kernel client doing after suspend? Does it in fact
>> reconnect under situations where ceph-fuse won't, and what are they?
>
> It looks to me like it is making a blind attempt to reconnect via
> peer_reset(), which is probably wrong.  Haven't thought through it,
> though.

Zheng seemed to think this was broken as well.

> There is an ancient ticket to make the client do a best-effort reconnect
> after the MDS reconnect period, but it's a hard to impossible task.
>
> For me, the minimum that we need to support well today is to make it
> clearly visible on the client whether or now we were disconnected so that
> any applications or humans using that mount can tell what happened.
> Zheng's patch for ceph-fuse that added the STALE state accomplishes this
> (by dumping mds_sessions on the ceph-fuse admin socket), and I backported
> just that patch to firefly (and dumpling? I forget).
>
> I think we should do the same thing for the kernel client so that you can
> look in /sys/kernel/debug/ceph/*/mdsc to get the same info.

Yeah. http://tracker.ceph.com/issues/8368

>> More interestingly, while suspended systems aren't part of our normal
>> target use case, they'd be nice to support well. The trivial solution
>> would be to somehow flush out all dirty data on suspend, and then on
>> wake or when we discover we have a reset session, we can clean out our
>> cache and reconnect as a new client if we have no dirty data.
>
> This will at least avoid losing client data, but I think it will take
> significant work to keep the client mount alive in any meaningful way.
> Even if all of the cache contents (including dentry) are blown away,
> there are still open files that may not exist afterwards, so at a minimum
> there needs to be a way to identify and mark those deleted inode refs
> as stale at reconnect time.  Perhaps it could all be a client-side thing
> based on fresh MDS sessions and open-by-ino?

Hmm, I hadn't considered the obvious problem of held-open files, but
yes, I was definitely thinking about this as a 100% client-side thing.
We could set up socket commands to flush everything and drop caps
(though this might take a while, but I guess that's what you get
anyway if you try and suspend with a lot of dirty data), and then on
resume get whatever we can on our open files. That leaves the
possibility of a third party deleting files you were working on or
something, but that pretty much seems like what you get; we can't
realistically stop it anyway.

>
>> Unfortunately, I don't know anything about Linux's suspend
>> functionality or APIs, and my weak attempts at googling and grepping
>> aren't turning anything up. So a question to everybody:
>>
>> 2) What notifications does Linux send, and what filesystem mechanisms
>> does it invoke, when it is suspending?
>> I see that it has in the past forced a sync whenever suspending, but I
>> think that's no longer required. Are there other interfaces we can
>> rely on, or use heuristically?
>
> There is a bunch of in-kernel infrastructure for doing sleep/wake stuff.
> For userspace, it sounds like Holger's systemd pointer is the most
> promising?

Yeah. I do notice that in the general case, doing a suspend with USB
drives attached will cause them to fail as well, so we're at least not
on our own in handling it badly (Documentation/power/swsusp.txt and
Documentation/usb/persist.txt). Looking at, e.g.
Documentation/power/s2ram.txt actually makes me think that the "sync"
has to come from userspace, rather than being automatic, so maybe the
kernel will just have to be opportunistic about this (or we can set up
ioctls or something).

Anyway, that's enough for me to think this is feasible but not urgent,
nor trivial. Maybe a good project for an intern or something. Thanks
guys! :) http://tracker.ceph.com/issues/8369
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2014-05-15 22:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-14 22:07 Suspend and the ceph clients Gregory Farnum
2014-05-15  8:13 ` Holger Hoffstätte
2014-05-15  8:28   ` Holger Hoffstätte
2014-05-15 14:29 ` Sage Weil
2014-05-15 22:19   ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.