Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: joystick <joystick@shiftmail.org>
To: nab@risingtidesystems.com
Cc: Peter Grandi <pg@lxra2.for.sabi.co.UK>,
	Linux RAID <linux-raid@vger.kernel.org>,
	target-devel@vger.kernel.org
Subject: Re: Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI)
Date: Wed, 19 Sep 2012 12:49:19 +0200	[thread overview]
Message-ID: <5059A32F.8030606@shiftmail.org> (raw)
In-Reply-To: <1348006818.25356.62.camel@haakon2.linux-iscsi.org>

On 09/19/12 00:20, Nicholas A. Bellinger wrote:
>
>>> Are you enabling emulate_write_cache=1 with your iblock
>>> backends..? This can have a gigantic effect on initiator
>>> performance for both MSFT + Linux SCSI clients.
>> That sounds interesting, but also potentially rather dangerous,
>> unless there is a very reliable implementation of IO barriers.
>> Just like with enabling write caches on real disks...
>>
> Not exactly.  The name of the 'emulate_write_cache' device attribute is
> a bit mis-leading here.  This bit simply reports (to the SCSI client)
> that the WCE=1 bit is set during SCSI mode sense (caching page) is read
> during the initial LUN scan.

Then can I say that the default is wrong?
You are declaring writethrough a device that is almost certainly a 
writeback (because at least HDDs will have caches).

If power is lost at the iscsi target, there WILL be data loss. People do 
not expect that. Change the default!

Besides this, I don't understand how declaring an iscsi target as 
writethrough could slow down operations volountarily by initiators. That 
would be a bug of the initiators because writethrough is "better" than 
writeback for all purposes: initiators should just skip the queue drain 
/ flush / FUA, and all the rest should be the same.

>>> [ ... ] check your [ ... ]/queue/max*sectors_kb for the MD
>>> RAID to make sure the WRITEs are striped aligned to get best
>>> performance with software MD raid.
>> That does not quite ensure that the writes are stripe aligned,
>> but perhaps a larger stripe cache would help.
>>
> I'm talking about what MD raid has chosen as it's underlying
> max_sectors_kb to issue I/O to the underlying raid member devices.  This
> depends on what backend storage hardware is in use, this may end up as
> '127', which will result in ugly mis-aligned writes that ends up killing
> performance.

Interesting observation.
For local processes writing, probably MD waits enough time for other 
requests to come and fill a stripe before initiating a rmw; but maybe 
iscsi is too slow for that and MD initiates an rmw for each request 
which would be a zillion of RMWs.
Can that be? Anyone knows MD enough to say if MD waits a little bit for 
more data in the attempt of filling an entire stripe before proceeding 
with rmw? If yes, can such timeout be set?

> We've (RTS) changed this with a one-liner patch to raid456.c code on .32
> basded distro kernels in the past to get proper stripe aligned writes,
> and it obviously makes a huge difference with fast storage hardware.

This value is writable via sysfs, why do you need a patch?

> That's exactly what I'm talking about.
>
> With buffered FILEIO enabled a incoming WRITE payload will have already
> been ACKs back to the SCSI fabric and up the storage -> filesystem
> stack, but if a power loss was to occur before that data has been
> written out (using a battery back-up unit for example), then the FS on
> the client will have (silently) lost data.
>
> This is why we removed the buffered FILEIO from mainline in the first
> place, but in retrospect if people understand the consequences and still
> want to use buffered FILEIO for performance reasons they should be able
> to do so.

If you declare the target as writeback and implement flush+FUA, no data 
loss should occur AFAIU, isn't that so?

AFAIR, hard disks do normally declare all operations to be complete 
immediately after you submit (while they are still in the cache in 
reality), but if you issue a flush+FUA they make an exception to this 
rule and make sure that this operation and all previously submitted 
operations are indeed on the platter before returning. Do I remember 
correctly?

Can you do the same for buffered FILEIO?

next prev parent reply	other threads:[~2012-09-19 10:49 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-18 14:37 Serious performance issues with mdadm RAID-5 partition exported through LIO (iSCSI) Ferry
2012-09-18 16:48 ` Chris Murphy
2012-09-18 19:49 ` Nicholas A. Bellinger
2012-09-18 21:18   ` Peter Grandi
2012-09-18 22:20     ` Nicholas A. Bellinger
2012-09-19 10:49       ` joystick [this message]
2012-09-23  1:01         ` Nicholas A. Bellinger
2012-09-19 14:19     ` freaky
2012-09-19 17:20       ` Chris Murphy
2012-09-19  6:44   ` Arne Redlich
2012-09-19 14:27     ` Ferry
2012-09-18 20:06 ` Peter Grandi
2012-09-19 13:08   ` freaky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5059A32F.8030606@shiftmail.org \
    --to=joystick@shiftmail.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=nab@risingtidesystems.com \
    --cc=pg@lxra2.for.sabi.co.UK \
    --cc=target-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).