* rbd locking and handling broken clients
@ 2012-06-13 17:40 Gregory Farnum
2012-06-13 20:37 ` Florian Haas
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Gregory Farnum @ 2012-06-13 17:40 UTC (permalink / raw)
To: ceph-devel; +Cc: mandell
We've had some user reports lately on rbd images being broken by
misbehaving clients — namely, rbd image I is mounted on computer A,
computer A starts misbehaving, and so I is mounted on computer B. But
because A is misbehaving it keeps writing to the image, corrupting it
horribly.
To handle this, we're working on two separate but related features:
1) Advisory RBD image locking. See
http://tracker.newdream.net/issues/1480 and the wip-rbd-locking
branch. With this addition clients gain the ability to do shared and
exclusive locking of images, which protects against accidentally
mounting a disk in two places at once — but because the images are
distributed across all OSDs this is of course entirely advisory, and a
misbehaving client is still perfectly capable of writing to a disk it
shouldn't. To handle that, we're also looking at...
2) Client fencing. See http://tracker.newdream.net/issues/2531. There
is an existing "blacklist" functionality in the OSDs/OSDMap, where you
can specify an "entity_addr_t" (consisting of an IP, a port, and a
nonce — so essentially unique per-process) which is not allowed to
communicate with the cluster any longer. The problem with this is that
since it's distributed as part of the OSDMap (via gossip), then if
it's important to have a point-in-time transition (as with an rbd
image), the new client needs every OSD to update their map before it
starts doing any reads or writes.
The initial idea in the bug was to have some sort of command you could
run on a per-image basis, which breaks the locks and does the
blacklist for the old locker — but if the problem is a misbehaving
hypervisor, then you may have to run that for several hundred images,
where each command needs to talk to several hundred OSDs. That's
super-lame and nobody wants to do it. The alternative is making an
admin/script do it on their own using the existing "ceph osd
blacklist" and about-to-exist "rbd lock break" functionality as
appropriate. That's also super-lame, because then they have to come up
with some way of spreading the map, and it's difficult to embed in
external libraries. So what I'm currently (as of 15 seconds ago)
leaning towards is a new rados command which will do the blacklist and
make sure the new map is distributed to each OSD ("rados
blacklist_and_spread 'address'"), and then requiring the automatic
system to:
a) Run the blacklist command,
b) individually break the locks necessary,
c) remount the image(s) elsewhere.
Are there any thoughts on these plans? Do they satisfy your needs in
this area, or are there holes you can think of?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: rbd locking and handling broken clients
2012-06-13 17:40 rbd locking and handling broken clients Gregory Farnum
@ 2012-06-13 20:37 ` Florian Haas
2012-06-13 23:41 ` Greg Farnum
2012-06-13 23:14 ` Tommi Virtanen
` (2 subsequent siblings)
3 siblings, 1 reply; 8+ messages in thread
From: Florian Haas @ 2012-06-13 20:37 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel, mandell
Greg,
My understanding of Ceph code internals is far too limited to comment on
your specific points, but allow me to ask a naive question.
Couldn't you be stealing a lot of ideas from SCSI-3 Persistent
Reservations? If you had server-side (OSD) persistence of information of
the "this device is in use by X" type (where anything other than X would
get an I/O error when attempting to access data), and you had a manual,
authenticated override akin to SCSI PR preemption, plus key
registration/exchange for that authentication, then you would at least
have to have the combination of a misbehaving OSD plus a malicious
client for data corruption. A non-malicious but just broken client
probably won't do.
Clearly I may be totally misguided, as Ceph is fundamentally
decentralized and SCSI isn't, but if PR-ish behavior comes even close to
what you're looking for, grabbing those ideas would look better to me
than designing your own wheel.
Just my $.02, of course.
Cheers,
Florian
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: rbd locking and handling broken clients
2012-06-13 20:37 ` Florian Haas
@ 2012-06-13 23:41 ` Greg Farnum
2012-06-14 10:37 ` Florian Haas
2012-06-14 18:06 ` Tommi Virtanen
0 siblings, 2 replies; 8+ messages in thread
From: Greg Farnum @ 2012-06-13 23:41 UTC (permalink / raw)
To: Florian Haas; +Cc: ceph-devel, mandell, Tommi Virtanen
On Wednesday, June 13, 2012 at 1:37 PM, Florian Haas wrote:
> Greg,
>
> My understanding of Ceph code internals is far too limited to comment on
> your specific points, but allow me to ask a naive question.
>
> Couldn't you be stealing a lot of ideas from SCSI-3 Persistent
> Reservations? If you had server-side (OSD) persistence of information of
> the "this device is in use by X" type (where anything other than X would
> get an I/O error when attempting to access data), and you had a manual,
> authenticated override akin to SCSI PR preemption, plus key
> registration/exchange for that authentication, then you would at least
> have to have the combination of a misbehaving OSD plus a malicious
> client for data corruption. A non-malicious but just broken client
> probably won't do.
>
> Clearly I may be totally misguided, as Ceph is fundamentally
> decentralized and SCSI isn't, but if PR-ish behavior comes even close to
> what you're looking for, grabbing those ideas would look better to me
> than designing your own wheel.
Yeah, the problem here is exactly that Ceph (and RBD) are fundamentally decentralized. :) I'm not familiar with the SCSI PR mechanism either, but it looks to me like it deals in entirely local information — the equivalent with RBD would require performing a locking operation on every object in the RBD image before you accessed it. We could do that, but then opening an image would take time linear in its size… :(
On Wednesday, June 13, 2012 at 4:14 PM, Tommi Virtanen wrote:
> On Wed, Jun 13, 2012 at 10:40 AM, Gregory Farnum <greg@inktank.com (mailto:greg@inktank.com)> wrote:
> > 2) Client fencing. See http://tracker.newdream.net/issues/2531. There
> > is an existing "blacklist" functionality in the OSDs/OSDMap, where you
> > can specify an "entity_addr_t" (consisting of an IP, a port, and a
> > nonce — so essentially unique per-process) which is not allowed to
> > communicate with the cluster any longer. The problem with this is that
>
> Does that work even after a TCP connection close & re-establish, where
> the client now has a new source port address? (Perhaps the port is 0
> for clients?)
Precisely — client ports are 0 since they never accept incoming connections.
> You know, I'd be really happy if this could be achieved by means of
> removing cephx keys.
Unfortunately, that wouldn't really solve the problem without dramatically decreasing the rotation interval for cluster access keys which cephx shares. Alternative (entirely theoretical) security schemes might, but they're well behind what's feasible for us to work on any time soon...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: rbd locking and handling broken clients
2012-06-13 23:41 ` Greg Farnum
@ 2012-06-14 10:37 ` Florian Haas
2012-06-14 18:06 ` Tommi Virtanen
1 sibling, 0 replies; 8+ messages in thread
From: Florian Haas @ 2012-06-14 10:37 UTC (permalink / raw)
To: Greg Farnum; +Cc: ceph-devel, mandell, Tommi Virtanen
On Thu, Jun 14, 2012 at 1:41 AM, Greg Farnum <greg@inktank.com> wrote:
> On Wednesday, June 13, 2012 at 1:37 PM, Florian Haas wrote:
>> Greg,
>>
>> My understanding of Ceph code internals is far too limited to comment on
>> your specific points, but allow me to ask a naive question.
>>
>> Couldn't you be stealing a lot of ideas from SCSI-3 Persistent
>> Reservations? If you had server-side (OSD) persistence of information of
>> the "this device is in use by X" type (where anything other than X would
>> get an I/O error when attempting to access data), and you had a manual,
>> authenticated override akin to SCSI PR preemption, plus key
>> registration/exchange for that authentication, then you would at least
>> have to have the combination of a misbehaving OSD plus a malicious
>> client for data corruption. A non-malicious but just broken client
>> probably won't do.
>>
>> Clearly I may be totally misguided, as Ceph is fundamentally
>> decentralized and SCSI isn't, but if PR-ish behavior comes even close to
>> what you're looking for, grabbing those ideas would look better to me
>> than designing your own wheel.
>
> Yeah, the problem here is exactly that Ceph (and RBD) are fundamentally decentralized. :)
True, but as a general comment I do posit that to say "X is not
exactly like Y, thus nothing applicable to X applies to Y" is a
fallacy. :)
> I'm not familiar with the SCSI PR mechanism either, but it looks to me like it deals in entirely local information — the equivalent with RBD would require performing a locking operation on every object in the RBD image before you accessed it. We could do that, but then opening an image would take time linear in its size… :(
Well you would make this configurable and optional, wouldn't you? Kind
of like no-one forces people to use PRs on SCSI LUs. When this is
being used, however, taking a performance hit on open sounds like a
reasonable price to pay for not shredding data. TANSTAAFL.
Again, this is just my poorly informed two cents. :)
Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: rbd locking and handling broken clients
2012-06-13 23:41 ` Greg Farnum
2012-06-14 10:37 ` Florian Haas
@ 2012-06-14 18:06 ` Tommi Virtanen
1 sibling, 0 replies; 8+ messages in thread
From: Tommi Virtanen @ 2012-06-14 18:06 UTC (permalink / raw)
To: Greg Farnum; +Cc: Florian Haas, ceph-devel, mandell
On Wed, Jun 13, 2012 at 4:41 PM, Greg Farnum <greg@inktank.com> wrote:
>> You know, I'd be really happy if this could be achieved by means of
>> removing cephx keys.
> Unfortunately, that wouldn't really solve the problem without dramatically decreasing the rotation interval for cluster access keys which cephx shares. Alternative (entirely theoretical) security schemes might, but they're well behind what's feasible for us to work on any time soon...
I wouldn't want to rely on timed rotation. Fencing triggering a
rotation on demand, then again.. that I do like.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: rbd locking and handling broken clients
2012-06-13 17:40 rbd locking and handling broken clients Gregory Farnum
2012-06-13 20:37 ` Florian Haas
@ 2012-06-13 23:14 ` Tommi Virtanen
2012-06-13 23:19 ` Tommi Virtanen
2012-06-14 19:44 ` Tommi Virtanen
3 siblings, 0 replies; 8+ messages in thread
From: Tommi Virtanen @ 2012-06-13 23:14 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel, mandell
On Wed, Jun 13, 2012 at 10:40 AM, Gregory Farnum <greg@inktank.com> wrote:
> 2) Client fencing. See http://tracker.newdream.net/issues/2531. There
> is an existing "blacklist" functionality in the OSDs/OSDMap, where you
> can specify an "entity_addr_t" (consisting of an IP, a port, and a
> nonce — so essentially unique per-process) which is not allowed to
> communicate with the cluster any longer. The problem with this is that
Does that work even after a TCP connection close & re-establish, where
the client now has a new source port address? (Perhaps the port is 0
for clients?)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: rbd locking and handling broken clients
2012-06-13 17:40 rbd locking and handling broken clients Gregory Farnum
2012-06-13 20:37 ` Florian Haas
2012-06-13 23:14 ` Tommi Virtanen
@ 2012-06-13 23:19 ` Tommi Virtanen
2012-06-14 19:44 ` Tommi Virtanen
3 siblings, 0 replies; 8+ messages in thread
From: Tommi Virtanen @ 2012-06-13 23:19 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel, mandell
On Wed, Jun 13, 2012 at 10:40 AM, Gregory Farnum <greg@inktank.com> wrote:
> 2) Client fencing. See http://tracker.newdream.net/issues/2531. There
You know, I'd be really happy if this could be achieved by means of
removing cephx keys.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: rbd locking and handling broken clients
2012-06-13 17:40 rbd locking and handling broken clients Gregory Farnum
` (2 preceding siblings ...)
2012-06-13 23:19 ` Tommi Virtanen
@ 2012-06-14 19:44 ` Tommi Virtanen
3 siblings, 0 replies; 8+ messages in thread
From: Tommi Virtanen @ 2012-06-14 19:44 UTC (permalink / raw)
To: Gregory Farnum; +Cc: ceph-devel, mandell
On Wed, Jun 13, 2012 at 10:40 AM, Gregory Farnum <greg@inktank.com> wrote:
> 2) Client fencing. See http://tracker.newdream.net/issues/2531. There
> is an existing "blacklist" functionality in the OSDs/OSDMap, where you
So I just managed to put into words another reason I like the key
rotation more than blacklisting: blacklisting fails open, key rotation
fails closed. That is, say something restart the client process, and
it gets a new pid: now it has a new unique id, and the old blacklist
entry no longer applies! Where as with key rotation, if you don't get
a new secret, you have snowballs chance in hell of getting it going
again.
The other reason that came up is, blacklisting is time-expiring (I
hear 24 hours currently), and I have absolutely no faith that
malfunctioning clients will actually always get manual intervention by
an admin within that time interval (or any other reasonable time
interval, either).
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-06-14 19:44 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-13 17:40 rbd locking and handling broken clients Gregory Farnum
2012-06-13 20:37 ` Florian Haas
2012-06-13 23:41 ` Greg Farnum
2012-06-14 10:37 ` Florian Haas
2012-06-14 18:06 ` Tommi Virtanen
2012-06-13 23:14 ` Tommi Virtanen
2012-06-13 23:19 ` Tommi Virtanen
2012-06-14 19:44 ` Tommi Virtanen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.