RAID6 write I/O amplification?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID6 write I/O amplification?
@ 2015-02-23 23:58 Roman Mamedov
  2015-02-24  6:29 ` AW: " Markus Stockhausen
  0 siblings, 1 reply; 4+ messages in thread
From: Roman Mamedov @ 2015-02-23 23:58 UTC (permalink / raw)
  To: linux-raid

Hello,

Got a bit of a "how does it actually work" question...

Suppose I have an MD RAID6 of 8 drives, with 64KB chunk size.

I am rewriting a 4KB filesystem sector somewhere on that RAID (not crossing
the stripe boundary).

What's the amount of disk I/O in total this will result in?

I assume the RAID will need to read data from all drives, recompute parity,
then write to the data stripe where the updated piece happened to be, and also
write to two parity stripes.

Is this done at a stripe granularity, so 6x64KB reads, 3x64KB writes?
Or down to individual sectors (pages), i.e. 6x4KB reads, 3x4KB writes?
Or am I describing this algorithm correctly at all?

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 4+ messages in thread

* AW: RAID6 write I/O amplification?
  2015-02-23 23:58 RAID6 write I/O amplification? Roman Mamedov
@ 2015-02-24  6:29 ` Markus Stockhausen
  2015-02-26  0:40   ` Alireza Haghdoost
  0 siblings, 1 reply; 4+ messages in thread
From: Markus Stockhausen @ 2015-02-24  6:29 UTC (permalink / raw)
  To: Roman Mamedov, linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1850 bytes --]

> Von: linux-raid-owner@vger.kernel.org [linux-raid-owner@vger.kernel.org]&quot; im Auftrag von &quot;Roman Mamedov [rm@romanrm.net]
> Gesendet: Dienstag, 24. Februar 2015 00:58
> An: linux-raid@vger.kernel.org
> Betreff: RAID6 write I/O amplification?
> 
> Hello,
> 
> Got a bit of a "how does it actually work" question...
> 
> Suppose I have an MD RAID6 of 8 drives, with 64KB chunk size.
> 
> I am rewriting a 4KB filesystem sector somewhere on that RAID (not crossing
> the stripe boundary).
> 
> What's the amount of disk I/O in total this will result in?
> 
> I assume the RAID will need to read data from all drives, recompute parity,
> then write to the data stripe where the updated piece happened to be, and also
> write to two parity stripes.
> 
> Is this done at a stripe granularity, so 6x64KB reads, 3x64KB writes?
> Or down to individual sectors (pages), i.e. 6x4KB reads, 3x4KB writes?
> Or am I describing this algorithm correctly at all?

Implementation will work on "internal" stripe granularity and that is 4K
So your case will be 6x4KB read + 3x4KB write. That said, you can only 
reduce the I/O overhead by writing data that is larger than your configured 
stripe size (e.g. 64K).

Looking at Neils development GIT you will find patches that allow 
read-modify-write cycles for RAID6. So we only need the old block, the
old parities, recaluclate them and write the new block and the new parities.
In your case that would reduce the I/Os to 3x4KB read + 3x4KB write.
See http://git.neil.brown.name/?p=md.git;a=shortlog;h=refs/heads/devel

I posted them 6 months ago but they did not made their way into the
stable tree. Additionally it conatins patches to batch adjacent writes
to be processed in less & larger I/Os. Currently Linux Raid will break
each operation into 4K I/Os.

Markus
=

[-- Attachment #2: InterScan_Disclaimer.txt --]
[-- Type: text/plain, Size: 1650 bytes --]

****************************************************************************
Diese E-Mail enthÃ¤lt vertrauliche und/oder rechtlich geschÃ¼tzte
Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail
irrtÃ¼mlich erhalten haben, informieren Sie bitte sofort den Absender und
vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte
Weitergabe dieser Mail ist nicht gestattet.

Ãœber das Internet versandte E-Mails kÃ¶nnen unter fremden Namen erstellt oder
manipuliert werden. Deshalb ist diese als E-Mail verschickte Nachricht keine
rechtsverbindliche WillenserklÃ¤rung.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 KÃ¶ln

Vorstand:
Kadir Akin
Dr. Michael HÃ¶hnerbach

Vorsitzender des Aufsichtsrates:
Hans Kristian Langva

Registergericht: Amtsgericht KÃ¶ln
Registernummer: HRB 52 497

This e-mail may contain confidential and/or privileged information. If you
are not the intended recipient (or have received this e-mail in error)
please notify the sender immediately and destroy this e-mail. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden.

e-mails sent over the internet may have been written under a wrong name or
been manipulated. That is why this message sent as an e-mail is not a
legally binding declaration of intention.

Collogia
Unternehmensberatung AG
Ubierring 11
D-50678 KÃ¶ln

executive board:
Kadir Akin
Dr. Michael HÃ¶hnerbach

President of the supervisory board:
Hans Kristian Langva

Registry office: district court Cologne
Register number: HRB 52 497

****************************************************************************

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID6 write I/O amplification?
  2015-02-24  6:29 ` AW: " Markus Stockhausen
@ 2015-02-26  0:40   ` Alireza Haghdoost
  2015-02-26  0:55     ` NeilBrown
  0 siblings, 1 reply; 4+ messages in thread
From: Alireza Haghdoost @ 2015-02-26  0:40 UTC (permalink / raw)
  To: Markus Stockhausen; +Cc: Roman Mamedov, linux-raid@vger.kernel.org

On Tue, Feb 24, 2015 at 12:29 AM, Markus Stockhausen
<stockhausen@collogia.de> wrote:
>> Von: linux-raid-owner@vger.kernel.org [linux-raid-owner@vger.kernel.org]&quot; im Auftrag von &quot;Roman Mamedov [rm@romanrm.net]
>> Gesendet: Dienstag, 24. Februar 2015 00:58
>> An: linux-raid@vger.kernel.org
>> Betreff: RAID6 write I/O amplification?
>>
>> Hello,
>>
>> Got a bit of a "how does it actually work" question...
>>
>> Suppose I have an MD RAID6 of 8 drives, with 64KB chunk size.
>>
>> I am rewriting a 4KB filesystem sector somewhere on that RAID (not crossing
>> the stripe boundary).
>>
>> What's the amount of disk I/O in total this will result in?
>>
>> I assume the RAID will need to read data from all drives, recompute parity,
>> then write to the data stripe where the updated piece happened to be, and also
>> write to two parity stripes.
>>
>> Is this done at a stripe granularity, so 6x64KB reads, 3x64KB writes?
>> Or down to individual sectors (pages), i.e. 6x4KB reads, 3x4KB writes?
>> Or am I describing this algorithm correctly at all?
>
> Implementation will work on "internal" stripe granularity and that is 4K
> So your case will be 6x4KB read + 3x4KB write.

Having said that, does it mean that following description of "chunk
size"  is wrong:
'[chunk size] is the smallest "atomic" mass of data that can be
written to the devices'
since in this case chunk size is 64KB but 4KB is written atomically (?).
I have find it in the kernel.org wiki page [1]

---
Alireza
1. https://raid.wiki.kernel.org/index.php/RAID_setup

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: RAID6 write I/O amplification?
  2015-02-26  0:40   ` Alireza Haghdoost
@ 2015-02-26  0:55     ` NeilBrown
  0 siblings, 0 replies; 4+ messages in thread
From: NeilBrown @ 2015-02-26  0:55 UTC (permalink / raw)
  To: Alireza Haghdoost
  Cc: Markus Stockhausen, Roman Mamedov, linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 2226 bytes --]

On Wed, 25 Feb 2015 18:40:46 -0600 Alireza Haghdoost <alireza@cs.umn.edu>
wrote:

> On Tue, Feb 24, 2015 at 12:29 AM, Markus Stockhausen
> <stockhausen@collogia.de> wrote:
> >> Von: linux-raid-owner@vger.kernel.org [linux-raid-owner@vger.kernel.org]&quot; im Auftrag von &quot;Roman Mamedov [rm@romanrm.net]
> >> Gesendet: Dienstag, 24. Februar 2015 00:58
> >> An: linux-raid@vger.kernel.org
> >> Betreff: RAID6 write I/O amplification?
> >>
> >> Hello,
> >>
> >> Got a bit of a "how does it actually work" question...
> >>
> >> Suppose I have an MD RAID6 of 8 drives, with 64KB chunk size.
> >>
> >> I am rewriting a 4KB filesystem sector somewhere on that RAID (not crossing
> >> the stripe boundary).
> >>
> >> What's the amount of disk I/O in total this will result in?
> >>
> >> I assume the RAID will need to read data from all drives, recompute parity,
> >> then write to the data stripe where the updated piece happened to be, and also
> >> write to two parity stripes.
> >>
> >> Is this done at a stripe granularity, so 6x64KB reads, 3x64KB writes?
> >> Or down to individual sectors (pages), i.e. 6x4KB reads, 3x4KB writes?
> >> Or am I describing this algorithm correctly at all?
> >
> > Implementation will work on "internal" stripe granularity and that is 4K
> > So your case will be 6x4KB read + 3x4KB write.
> 
> Having said that, does it mean that following description of "chunk
> size"  is wrong:
> '[chunk size] is the smallest "atomic" mass of data that can be
> written to the devices'
> since in this case chunk size is 64KB but 4KB is written atomically (?).
> I have find it in the kernel.org wiki page [1]

I think that when it says "atomic" it means in space, not time.

i.e. one (properly aligned) chunk of data will not be split up and 
written to different devices, it will all be written to one device.
If you write more than a chunk, it will be split up and parts of if written
to different devices.

You can still write less than a chunk.

So the intent is correct I think, but the word "atomic" doesn't really convey
the right meaning.  Probably it should be re-written to avoid that term and
just spell out what is happening.

NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-02-26  0:55 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-23 23:58 RAID6 write I/O amplification? Roman Mamedov
2015-02-24  6:29 ` AW: " Markus Stockhausen
2015-02-26  0:40   ` Alireza Haghdoost
2015-02-26  0:55     ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).