* [PATCH] md/raid5: don't do chunk aligned read on degraded array. @ 2015-03-19 5:39 Eric Mei 2015-03-19 6:02 ` NeilBrown 0 siblings, 1 reply; 4+ messages in thread From: Eric Mei @ 2015-03-19 5:39 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid From: Eric Mei <eric.mei@seagate.com> When array is degraded, read data landed on failed drives will result in reading rest of data in a stripe. So a single sequential read would result in same data being read twice. This patch is to avoid chunk aligned read for degraded array. The downside is to involve stripe cache which means associated CPU overhead and extra memory copy. Signed-off-by: Eric Mei <eric.mei@seagate.com> --- drivers/md/raid5.c | 15 ++++++++++++--- 1 files changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index cd2f96b..763c64a 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4180,8 +4180,12 @@ static int raid5_mergeable_bvec(struct mddev *mddev, unsigned int chunk_sectors = mddev->chunk_sectors; unsigned int bio_sectors = bvm->bi_size >> 9; - if ((bvm->bi_rw & 1) == WRITE) - return biovec->bv_len; /* always allow writes to be mergeable */ + /* + * always allow writes to be mergeable, read as well if array + * is degraded as we'll go through stripe cache anyway. + */ + if ((bvm->bi_rw & 1) == WRITE || mddev->degraded) + return biovec->bv_len; if (mddev->new_chunk_sectors < mddev->chunk_sectors) chunk_sectors = mddev->new_chunk_sectors; @@ -4656,7 +4660,12 @@ static void make_request(struct mddev *mddev, struct bio * bi) md_write_start(mddev, bi); - if (rw == READ && + /* + * If array is degraded, better not do chunk aligned read because + * later we might have to read it again in order to reconstruct + * data on failed drives. + */ + if (rw == READ && mddev->degraded == 0 && mddev->reshape_position == MaxSector && chunk_aligned_read(mddev,bi)) return; -- 1.7.1 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] md/raid5: don't do chunk aligned read on degraded array. 2015-03-19 5:39 [PATCH] md/raid5: don't do chunk aligned read on degraded array Eric Mei @ 2015-03-19 6:02 ` NeilBrown 2015-03-19 19:41 ` Eric Mei 0 siblings, 1 reply; 4+ messages in thread From: NeilBrown @ 2015-03-19 6:02 UTC (permalink / raw) To: Eric Mei; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 2473 bytes --] On Wed, 18 Mar 2015 23:39:11 -0600 Eric Mei <meijia@gmail.com> wrote: > From: Eric Mei <eric.mei@seagate.com> > > When array is degraded, read data landed on failed drives will result in > reading rest of data in a stripe. So a single sequential read would > result in same data being read twice. > > This patch is to avoid chunk aligned read for degraded array. The > downside is to involve stripe cache which means associated CPU overhead > and extra memory copy. > > Signed-off-by: Eric Mei <eric.mei@seagate.com> > --- > drivers/md/raid5.c | 15 ++++++++++++--- > 1 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index cd2f96b..763c64a 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -4180,8 +4180,12 @@ static int raid5_mergeable_bvec(struct mddev *mddev, > unsigned int chunk_sectors = mddev->chunk_sectors; > unsigned int bio_sectors = bvm->bi_size >> 9; > > - if ((bvm->bi_rw & 1) == WRITE) > - return biovec->bv_len; /* always allow writes to be > mergeable */ > + /* > + * always allow writes to be mergeable, read as well if array > + * is degraded as we'll go through stripe cache anyway. > + */ > + if ((bvm->bi_rw & 1) == WRITE || mddev->degraded) > + return biovec->bv_len; > > if (mddev->new_chunk_sectors < mddev->chunk_sectors) > chunk_sectors = mddev->new_chunk_sectors; > @@ -4656,7 +4660,12 @@ static void make_request(struct mddev *mddev, > struct bio * bi) > > md_write_start(mddev, bi); > > - if (rw == READ && > + /* > + * If array is degraded, better not do chunk aligned read because > + * later we might have to read it again in order to reconstruct > + * data on failed drives. > + */ > + if (rw == READ && mddev->degraded == 0 && > mddev->reshape_position == MaxSector && > chunk_aligned_read(mddev,bi)) > return; Thanks for the patch. However this sort of patch really needs to come with some concrete performance numbers. Preferably both sequential reads and random reads. I agree that sequential reads are likely to be faster, but how much faster are they? I imagine that this might make random reads a little slower. Does it? By how much? Thanks, NeilBrown [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] md/raid5: don't do chunk aligned read on degraded array. 2015-03-19 6:02 ` NeilBrown @ 2015-03-19 19:41 ` Eric Mei 2015-04-20 6:20 ` NeilBrown 0 siblings, 1 reply; 4+ messages in thread From: Eric Mei @ 2015-03-19 19:41 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid On 2015-03-19 12:02 AM, NeilBrown wrote: > On Wed, 18 Mar 2015 23:39:11 -0600 Eric Mei <meijia@gmail.com> wrote: > >> From: Eric Mei <eric.mei@seagate.com> >> >> When array is degraded, read data landed on failed drives will result in >> reading rest of data in a stripe. So a single sequential read would >> result in same data being read twice. >> >> This patch is to avoid chunk aligned read for degraded array. The >> downside is to involve stripe cache which means associated CPU overhead >> and extra memory copy. >> >> Signed-off-by: Eric Mei <eric.mei@seagate.com> >> --- >> drivers/md/raid5.c | 15 ++++++++++++--- >> 1 files changed, 12 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c >> index cd2f96b..763c64a 100644 >> --- a/drivers/md/raid5.c >> +++ b/drivers/md/raid5.c >> @@ -4180,8 +4180,12 @@ static int raid5_mergeable_bvec(struct mddev *mddev, >> unsigned int chunk_sectors = mddev->chunk_sectors; >> unsigned int bio_sectors = bvm->bi_size >> 9; >> >> - if ((bvm->bi_rw & 1) == WRITE) >> - return biovec->bv_len; /* always allow writes to be >> mergeable */ >> + /* >> + * always allow writes to be mergeable, read as well if array >> + * is degraded as we'll go through stripe cache anyway. >> + */ >> + if ((bvm->bi_rw & 1) == WRITE || mddev->degraded) >> + return biovec->bv_len; >> >> if (mddev->new_chunk_sectors < mddev->chunk_sectors) >> chunk_sectors = mddev->new_chunk_sectors; >> @@ -4656,7 +4660,12 @@ static void make_request(struct mddev *mddev, >> struct bio * bi) >> >> md_write_start(mddev, bi); >> >> - if (rw == READ && >> + /* >> + * If array is degraded, better not do chunk aligned read because >> + * later we might have to read it again in order to reconstruct >> + * data on failed drives. >> + */ >> + if (rw == READ && mddev->degraded == 0 && >> mddev->reshape_position == MaxSector && >> chunk_aligned_read(mddev,bi)) >> return; > > Thanks for the patch. > > However this sort of patch really needs to come with some concrete > performance numbers. Preferably both sequential reads and random reads. > > I agree that sequential reads are likely to be faster, but how much faster > are they? > I imagine that this might make random reads a little slower. Does it? By > how much? > > Thanks, > NeilBrown > Hi Neil, Sorry I should have done the test in first place. Following test are done on a enterprise storage node with Seagate 6T SAS drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2, chunk size 128 KiB. I use FIO, using direct-io with various bs size, enough queue depth, tested sequential and 100% random read against 3 array config: 1) optimal, as baseline; 2) degraded; 3) degraded with this patch. Kernel version is 4.0-rc3. Each individual test I only did once so there might be some variations, but we just focus on big trend. Sequential Read: bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s) 1024 1608 656 995 512 1624 710 956 256 1635 728 980 128 1636 771 983 64 1612 1119 1000 32 1580 1420 1004 16 1368 688 986 8 768 647 953 4 411 413 850 Random Read: bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS) 1024 163 160 156 512 274 273 272 256 426 428 424 128 576 592 591 64 726 724 726 32 849 848 837 16 900 970 971 8 927 940 929 4 948 940 955 Some notes: * In sequential + optimal, as bs size getting smaller, the FIO thread become CPU bound. * In sequential + degraded, there's big increase when bs is 64K and 32K, I don't have explanation. * In sequential + degraded-with-patch, the MD thread mostly become CPU bound. If you want to we can discuss specific data point in those data. But in general it seems with this patch, we have more predictable and in most cases significant better sequential read performance when array is degraded, and almost no noticeable impact on random read. Performance is a complicated thing, the patch works well for this particular configuration, but may not be universal. For example I imagine testing on all SSD array may have very different result. But I personally think in most cases IO bandwidth is more scarce resource than CPU. Eric ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] md/raid5: don't do chunk aligned read on degraded array. 2015-03-19 19:41 ` Eric Mei @ 2015-04-20 6:20 ` NeilBrown 0 siblings, 0 replies; 4+ messages in thread From: NeilBrown @ 2015-04-20 6:20 UTC (permalink / raw) To: Eric Mei; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 5874 bytes --] On Thu, 19 Mar 2015 13:41:16 -0600 Eric Mei <meijia@gmail.com> wrote: > On 2015-03-19 12:02 AM, NeilBrown wrote: > > On Wed, 18 Mar 2015 23:39:11 -0600 Eric Mei <meijia@gmail.com> wrote: > > > >> From: Eric Mei <eric.mei@seagate.com> > >> > >> When array is degraded, read data landed on failed drives will result in > >> reading rest of data in a stripe. So a single sequential read would > >> result in same data being read twice. > >> > >> This patch is to avoid chunk aligned read for degraded array. The > >> downside is to involve stripe cache which means associated CPU overhead > >> and extra memory copy. > >> > >> Signed-off-by: Eric Mei <eric.mei@seagate.com> > >> --- > >> drivers/md/raid5.c | 15 ++++++++++++--- > >> 1 files changed, 12 insertions(+), 3 deletions(-) > >> > >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > >> index cd2f96b..763c64a 100644 > >> --- a/drivers/md/raid5.c > >> +++ b/drivers/md/raid5.c > >> @@ -4180,8 +4180,12 @@ static int raid5_mergeable_bvec(struct mddev *mddev, > >> unsigned int chunk_sectors = mddev->chunk_sectors; > >> unsigned int bio_sectors = bvm->bi_size >> 9; > >> > >> - if ((bvm->bi_rw & 1) == WRITE) > >> - return biovec->bv_len; /* always allow writes to be > >> mergeable */ > >> + /* > >> + * always allow writes to be mergeable, read as well if array > >> + * is degraded as we'll go through stripe cache anyway. > >> + */ > >> + if ((bvm->bi_rw & 1) == WRITE || mddev->degraded) > >> + return biovec->bv_len; > >> > >> if (mddev->new_chunk_sectors < mddev->chunk_sectors) > >> chunk_sectors = mddev->new_chunk_sectors; > >> @@ -4656,7 +4660,12 @@ static void make_request(struct mddev *mddev, > >> struct bio * bi) > >> > >> md_write_start(mddev, bi); > >> > >> - if (rw == READ && > >> + /* > >> + * If array is degraded, better not do chunk aligned read because > >> + * later we might have to read it again in order to reconstruct > >> + * data on failed drives. > >> + */ > >> + if (rw == READ && mddev->degraded == 0 && > >> mddev->reshape_position == MaxSector && > >> chunk_aligned_read(mddev,bi)) > >> return; > > > > Thanks for the patch. > > > > However this sort of patch really needs to come with some concrete > > performance numbers. Preferably both sequential reads and random reads. > > > > I agree that sequential reads are likely to be faster, but how much faster > > are they? > > I imagine that this might make random reads a little slower. Does it? By > > how much? > > > > Thanks, > > NeilBrown > > > > Hi Neil, > > Sorry I should have done the test in first place. > > Following test are done on a enterprise storage node with Seagate 6T SAS > drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2, > chunk size 128 KiB. > > I use FIO, using direct-io with various bs size, enough queue depth, > tested sequential and 100% random read against 3 array config: 1) > optimal, as baseline; 2) degraded; 3) degraded with this patch. Kernel > version is 4.0-rc3. > > Each individual test I only did once so there might be some variations, > but we just focus on big trend. > > Sequential Read: > bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s) > 1024 1608 656 995 > 512 1624 710 956 > 256 1635 728 980 > 128 1636 771 983 > 64 1612 1119 1000 > 32 1580 1420 1004 > 16 1368 688 986 > 8 768 647 953 > 4 411 413 850 > > Random Read: > bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS) > 1024 163 160 156 > 512 274 273 272 > 256 426 428 424 > 128 576 592 591 > 64 726 724 726 > 32 849 848 837 > 16 900 970 971 > 8 927 940 929 > 4 948 940 955 > > Some notes: > * In sequential + optimal, as bs size getting smaller, the FIO thread > become CPU bound. > * In sequential + degraded, there's big increase when bs is 64K and > 32K, I don't have explanation. > * In sequential + degraded-with-patch, the MD thread mostly become CPU > bound. > > If you want to we can discuss specific data point in those data. But in > general it seems with this patch, we have more predictable and in most > cases significant better sequential read performance when array is > degraded, and almost no noticeable impact on random read. > > Performance is a complicated thing, the patch works well for this > particular configuration, but may not be universal. For example I > imagine testing on all SSD array may have very different result. But I > personally think in most cases IO bandwidth is more scarce resource than > CPU. > > Eric Thanks. That is reasonably convincing. I've added that text to the commit message, fixed up all the white-space damage in the patch (tabs were converted to spaces etc ... if you are going to be sending more patches, please find a way to convince your mailer that spaces are important), and applied it. It should be included in my pull request for 4.1 Thanks, NeilBrown [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2015-04-20 6:20 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-03-19 5:39 [PATCH] md/raid5: don't do chunk aligned read on degraded array Eric Mei 2015-03-19 6:02 ` NeilBrown 2015-03-19 19:41 ` Eric Mei 2015-04-20 6:20 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).