From: Wols Lists <antlists@youngman.org.uk>
To: David Brown <david.brown@hesbynett.no>, Shaohua Li <shli@kernel.org>
Cc: linux-raid@vger.kernel.org, jes.sorensen@gmail.com, neilb@suse.de
Subject: Re: RAID creation resync behaviors
Date: Thu, 4 May 2017 17:02:05 +0100 [thread overview]
Message-ID: <590B507D.6050609@youngman.org.uk> (raw)
In-Reply-To: <590ADA3F.8070909@hesbynett.no>
On 04/05/17 08:37, David Brown wrote:
> On 04/05/17 03:54, Shaohua Li wrote:
>> > On Wed, May 03, 2017 at 11:06:01PM +0200, David Brown wrote:
>>> >> On 03/05/17 22:27, Shaohua Li wrote:
>>>> >>> Hi,
>>>> >>>
>>>> >>> Currently we have different resync behaviors in array creation.
>>>> >>>
>>>> >>> - raid1: copy data from disk 0 to disk 1 (overwrite)
>>>> >>> - raid10: read both disks, compare and write if there is difference (compare-write)
>>>> >>> - raid4/5: read first n-1 disks, calculate parity and then write parity to the last disk (overwrite)
>>>> >>> - raid6: read all disks, calculate parity and compare, and write if there is difference (compare-write)
>>>> >>>
>>>> >>> Write whole disk is very unfriendly for SSD, because it reduces lifetime. And
>>>> >>> if user already does a trim before creation, the unncessary write could make
>>>> >>> SSD slower in the future. Could we prefer compare-write to overwrite if mdadm
>>>> >>> detects the disks are SSD? Surely sometimes compare-write is slower than
>>>> >>> overwrite, so maybe add new option in mdadm. An option to let mdadm trim SSD
>>>> >>> before creation sounds reasonable too.
>>>> >>>
>>> >>
>>> >> When doing the first sync, md tracks how far its sync has got, keeping a
>>> >> record in the metadata in case it has to be restarted (such as due to a
>>> >> reboot while syncing). Why not simply /not/ sync stripes until you first
>>> >> write to them? It may be that a counter of synced stripes is not enough,
>>> >> and you need a bitmap (like the write intent bitmap), but it would reduce
>>> >> the creation sync time to 0 and avoid any writes at all.
>> >
>> > For raid 4/5/6, this means we always must do a full stripe write for any normal
>> > write if it hits a range not synced. This would harm the performance of the
>> > norma write.
> Agreed. The unused sectors could be set to 0, rather than read from the
> disks - that would reduce the latency and be friendly to high-end SSDs
> with compression (zero blocks compress quite well!).
>
>> > For raid1/10, this sounds more appealing. But since each bit in
>> > the bitmap will stand for a range. If only part of the range is written by
>> > normal IO, we have two choices. sync the range immediately and clear the bit,
>> > this sync will impact normal IO. Don't do the sync immediately, but since the
>> > bit is set (which means the range isn't synced), read IO can only access the
>> > first disk, which is harmful too.
>> >
> This could be done in a more sophisticated manner. (Yes, I appreciate
> that "sophisticated" or "complex" are a serious disadvantage - I'm just
> throwing up ideas that could be considered.)
>
> Divide the array into "sync blocks", each covering a range of stripes,
> with a bitmap of three states - unused, partially synced, fully synced.
> All blocks start off unused. If a write is made to a previously unused
> block, that block becomes partially synced, and the write has to be done
> as a full stripe write. For a partially synced block, keep a list of
> ranges of synced stripes (a list will normally be smaller than a bitmap
> here). Whenever there are partially synced blocks in the array, have a
> low priority process (like the normal array creation sync process, or
> rebuild processes) sync the stripes until the block is finished as a
> fully synced block.
>
> That should let you delay the time-consuming and write intensive
> creation sync until you actually need to sync the blocks, without /too/
> much overhead in metadata or in delays when using the disk.
I was thinking along those lines. You mentioned earlier what I would
think of as a "high water mark" - or "how far have we used the array".
The only snag I can think of there is if you start writing in the middle
of the array so your idea of blocks sounds a lot better.
The other thing - this would probably be a synonym of "--assume-clean"
but create a flag "--new-array". This would have to be an opt-in - it
tells mdadm that whatever is on the disk is garbage, and when it does
sync it can safely just stream zeroes to the disk - no reads or parity
checks required ... :-) (This idea might need a few tweaks :-)
Cheers,
Wol
next prev parent reply other threads:[~2017-05-04 16:02 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-05-03 20:27 RAID creation resync behaviors Shaohua Li
2017-05-03 21:06 ` David Brown
2017-05-04 1:54 ` Shaohua Li
2017-05-04 7:37 ` David Brown
2017-05-04 16:02 ` Wols Lists [this message]
2017-05-04 21:57 ` NeilBrown
2017-05-05 6:46 ` David Brown
2017-05-04 15:50 ` Wols Lists
2017-05-04 22:00 ` NeilBrown
2017-05-03 23:58 ` Andreas Klauer
2017-05-04 2:22 ` Shaohua Li
2017-05-04 7:55 ` Andreas Klauer
2017-05-04 8:06 ` Roman Mamedov
2017-05-04 15:20 ` Brad Campbell
2017-05-04 1:07 ` NeilBrown
2017-05-04 2:04 ` Shaohua Li
2017-05-09 18:39 ` Jes Sorensen
2017-05-09 20:30 ` NeilBrown
2017-05-09 20:49 ` Jes Sorensen
2017-05-09 21:03 ` Martin K. Petersen
2017-05-09 21:11 ` Jes Sorensen
2017-05-09 21:16 ` Martin K. Petersen
2017-05-09 21:22 ` Jes Sorensen
2017-05-09 23:56 ` Martin K. Petersen
2017-05-10 5:58 ` Hannes Reinecke
2017-05-10 22:20 ` Martin K. Petersen
2017-05-10 17:30 ` Shaohua Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=590B507D.6050609@youngman.org.uk \
--to=antlists@youngman.org.uk \
--cc=david.brown@hesbynett.no \
--cc=jes.sorensen@gmail.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
--cc=shli@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).