Clustered RAID1 performance

All of lore.kernel.org
 help / color / mirror / Atom feed

* Clustered RAID1 performance
@ 2013-05-30  9:59 Lars Marowsky-Bree
  2013-05-31 15:58 ` Brassow Jonathan
  0 siblings, 1 reply; 3+ messages in thread
From: Lars Marowsky-Bree @ 2013-05-30  9:59 UTC (permalink / raw)
  To: dm-devel

Hi all,

we see a significant performance hit when mirroring is used for a cLVM2
LV.

That's clearly due to the performance overhead of bouncing to user-space
(and worse, to the network) for locking etc.

I wonder if consideration has been given to how this could be improved?
Using the in-kernel DLM and holding locks for regions the local node
writes to for longer, exclusive locks while noone is reading,
parallelizing the resync ...? How is the long-term perspective for this
given the dm-raid/md raid stuff?

Before we go drafting I wanted to ask for ideas that are already
floating around ;-) Anyone working on this?

Regards,
    Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Clustered RAID1 performance
  2013-05-30  9:59 Clustered RAID1 performance Lars Marowsky-Bree
@ 2013-05-31 15:58 ` Brassow Jonathan
  2013-06-03 20:45   ` Lars Marowsky-Bree
  0 siblings, 1 reply; 3+ messages in thread
From: Brassow Jonathan @ 2013-05-31 15:58 UTC (permalink / raw)
  To: device-mapper development

On May 30, 2013, at 4:59 AM, Lars Marowsky-Bree wrote:

> Hi all,
> 
> we see a significant performance hit when mirroring is used for a cLVM2
> LV.
> 
> That's clearly due to the performance overhead of bouncing to user-space
> (and worse, to the network) for locking etc.
> 
> I wonder if consideration has been given to how this could be improved?
> Using the in-kernel DLM and holding locks for regions the local node
> writes to for longer, exclusive locks while noone is reading,
> parallelizing the resync ...? How is the long-term perspective for this
> given the dm-raid/md raid stuff?
> 
> Before we go drafting I wanted to ask for ideas that are already
> floating around ;-) Anyone working on this?

There isn't any active work being done in this area right now.

What is your test set-up in which you are seeing the performance hit?  In the past when I have tested with GFS2, I did see some performance degradation, but it's not what I would have called significant.  I was not testing with SSDs at the time though.  Also, people are using cluster mirrors in different ways these days.  They may have the mirror active on multiple hosts concurrently, but they only really use it from one host.  Clearly in that case, the ideas you mentioned could make a difference.

I have given some thought to making MD RAID1 cluster-aware.  (RAID10 would come for free, but RAID4/5/6 would be excluded.)  Device-mapper would then make use of this code via the dm-raid.c wrapper.  My idea for the new implementation would have been to keep a separate bitmap area for each machine.  This way, there would be no locking and no need to keep the log state collectively in-sync during nominal operation.  When machines come, go or fail, their bitmaps would have to be merged and responsibility for recovery/initialization/scrubbing would have to be decided.  Additionally, handling device failures is more tricky in MD RAID.  This is because MD RAID (and by extension, the device-mapper targets that leverage it) simply marks a device as failed in the superblock and keeps working while
  DM "mirror" blocks I/O until the failed device is cleared.  This makes a difference in the cluster because one machine may suffer a device failure due to connectivity and another machine m!
 ay not.  If the machine suffering the failure simply marks the failure in the superblock (which will also need to be coordinated) and proceeds, the other machine may then attempt a read from the device and grab a copy of data that is stale.

So, there are some things to think through, but nothing insurmountable.

thanks for your interest,
 brassow

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Clustered RAID1 performance
  2013-05-31 15:58 ` Brassow Jonathan
@ 2013-06-03 20:45   ` Lars Marowsky-Bree
  0 siblings, 0 replies; 3+ messages in thread
From: Lars Marowsky-Bree @ 2013-06-03 20:45 UTC (permalink / raw)
  To: dm-devel

On 2013-05-31T10:58:35, Brassow Jonathan <jbrassow@redhat.com> wrote:

Hi Jon, thanks for the response.

> What is your test set-up in which you are seeing the performance hit?

We saw it essentially as soon as we used "lvcreate -m1" on an LV; the
performance drop would be about 70% compared to -m0. Exclusive
activation would bring it back to ~95%.

(I wonder if that could be related to still using an older lvm2
user-space. Hm. Need to benchmark again with the latest.)

Actually concurrency via OCFS2 would of course have a worse impact.

I'm not quite sure where the split is; many use cases are "One active
process, but they want the LV to be visible everywhere and the election
to be automatic", but a fair share is "concurrent processes/OCFS2/GFS2,
replacing traditional expensive SAN mirroring (while not being ready
for ceph/gluster)" too.

The first part is reasonably trivial - enforce the one open process via
an exclusive lock automatically, and get to avoid the entire rest of the
cluster overhead.

Optimizing the latter case is harder.

> I have given some thought to making MD RAID1 cluster-aware.  (RAID10
> would come for free, but RAID4/5/6 would be excluded.) 

Yes, I guess that cover 99% of the interesting pieces anyway.

> would then make use of this code via the dm-raid.c wrapper.  My idea
> for the new implementation would have been to keep a separate bitmap
> area for each machine.  This way, there would be no locking and no
> need to keep the log state collectively in-sync during nominal
> operation.

Yes, I can see this could work. (Similarly to how OCFS2/GFS2 keep a
per-node journal too.)

Could be a bit tricky to get to the point where you could do full
read-balancing.

And, of course, this whole complexity is only needed for the truly
concurrent IO.

> When machines come, go or fail, their bitmaps would have
> to be merged and responsibility for recovery/initialization/scrubbing
> would have to be decided.

Right, but that's easy enough.

> Additionally, handling device failures is
> more tricky in MD RAID.  This is because MD RAID (and by extension,
> the device-mapper targets that leverage it) simply marks a device as
> failed in the superblock and keeps working while DM "mirror" blocks
> I/O until the failed device is cleared.

I guess though that could be enhanced.

> This makes a difference in the cluster because one machine may suffer
> a device failure due to connectivity and another machine may not.

Yes, understood. And we then want to degrade as gracefully as we can,
too.

(I sometimes keep wondering if, depending on the interconnect in the
cluster, the "ship all writes to a central process, read locally,
fail-over that process" isn't cheaper. I'm pretty sure that for
read-intensive workloads, it probably is.)

> If the machine suffering the failure simply marks the failure in the
> superblock (which will also need to be coordinated) and proceeds, the
> other machine may then attempt a read from the device and grab a copy
> of data that is stale.

Right. Cheapest way is to mark the drive failed on all nodes at the same
time; anything else actually does require write-shipping.

Regards,
    Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2013-06-03 20:45 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-30  9:59 Clustered RAID1 performance Lars Marowsky-Bree
2013-05-31 15:58 ` Brassow Jonathan
2013-06-03 20:45   ` Lars Marowsky-Bree

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.