cgroups-blkio CFQ scheduling does not work well in a RAID5 configuration.

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* cgroups-blkio CFQ scheduling does not work well in a RAID5 configuration.
@ 2013-11-29 14:06 Martin Boutin
       [not found] ` <CACtJ3Hak1pvYLRDSC3MELeo_jV8_1Tp8rjte5SODyvNauoFv4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Martin Boutin @ 2013-11-29 14:06 UTC (permalink / raw)
  To: Jens Axboe, Kernel.org-Linux-RAID, Kernel.org-Linux-cgroups

Hello list,

Today I was trying to figure out how to get block I/O prioritization
working for a certain process. The process is a streaming server that
reads a big file stored in a filesystem (xfs) on top of a RAID5
configuration using 3 disks, using O_DIRECT.

I'm setting up cgroups this way:
$ echo 1000 > /sys/fs/cgroup/blkio/prio/blkio.weight
$ echo 10 > /sys/fs/cgroup/blkio/blkio.leaf_weight

meaning that all the tasks in the prio cgroup will have unconstrained
access time to the disk, while all the other tasks will have their
disk access time weighted by a factor.

If I ignore the RAID5 setup, create a XFS filesystem on /dev/sdb2,
mount it on /data and put my streaming daemon in the prio cgroup and
run the daemon by streaming around 250MiB/s of data, while I launch
fio with disk I/O intensive tasks. For a period of 5 minutes, the
streaming deamon had to stop streaming in about 5 times to rebuffer.

Now, if I consider the same scenario but using the RAID5 device and
letting the daemon stream 500MiB/s of data (because the RAID has
around twice the throughput of a single drive), after a period of 5
minutes the streaming daemon had to stop streaming in about 50 times!
This is 10 times more than the single drive case.

While streaming, I observed both blkio.sectors and blkio.io_queued for
both cgroups (the root node and prio). If only the streaming daemon is
run (therefore fio is stopped), the sector count in prio/blkio.sectors
increases while (root)/blkio.sectors does not. This confirms the
streaming daemon is correctly identified as in the prio cgroup.
Then, while both the streaming daemon and fio run, observing io_queued
shows that for the root cgroup there is about 50 queued request in
total (in average), while for the prio cgroup there is only one
ocasional delayed request from time to time.

$ uname -a
Linux haswell1 3.10.10 #9 SMP PREEMPT Fri Nov 29 11:38:20 CET 2013
i686 GNU/Linux

Any ideas?

Thanks,
-- 
Martin Boutin

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: cgroups-blkio CFQ scheduling does not work well in a RAID5 configuration.
       [not found] ` <CACtJ3Hak1pvYLRDSC3MELeo_jV8_1Tp8rjte5SODyvNauoFv4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-29 14:15   ` Martin Boutin
       [not found]     ` <CAGqmV7pPb-Gh=B_Cjv0Az+PBxTHtPX=ECqkvFdd2Ej-hTfLqOQ@mail.gmail.com>
  0 siblings, 1 reply; 4+ messages in thread
From: Martin Boutin @ 2013-11-29 14:15 UTC (permalink / raw)
  To: Jens Axboe, Kernel.org-Linux-RAID, Kernel.org-Linux-cgroups

I forgot to suggest that this might have to do with md0_raid5 process.
The process has to take care of RAID parity for both processes
(streaming daemon and fio). By default it stays in the root cgroup
which means that RAID-related I/O will be unprioritized even for
processes in the prio cgroup, this might be introducing delays in the
I/O.
Otherwise I cannot put the md0_raid5 process in the prio cgroup either
because that would have RAID-related I/O from all other processes
stealing disk time from priority processes.

On Fri, Nov 29, 2013 at 9:06 AM, Martin Boutin <martboutin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hello list,
>
> Today I was trying to figure out how to get block I/O prioritization
> working for a certain process. The process is a streaming server that
> reads a big file stored in a filesystem (xfs) on top of a RAID5
> configuration using 3 disks, using O_DIRECT.
>
> I'm setting up cgroups this way:
> $ echo 1000 > /sys/fs/cgroup/blkio/prio/blkio.weight
> $ echo 10 > /sys/fs/cgroup/blkio/blkio.leaf_weight
>
> meaning that all the tasks in the prio cgroup will have unconstrained
> access time to the disk, while all the other tasks will have their
> disk access time weighted by a factor.
>
> If I ignore the RAID5 setup, create a XFS filesystem on /dev/sdb2,
> mount it on /data and put my streaming daemon in the prio cgroup and
> run the daemon by streaming around 250MiB/s of data, while I launch
> fio with disk I/O intensive tasks. For a period of 5 minutes, the
> streaming deamon had to stop streaming in about 5 times to rebuffer.
>
> Now, if I consider the same scenario but using the RAID5 device and
> letting the daemon stream 500MiB/s of data (because the RAID has
> around twice the throughput of a single drive), after a period of 5
> minutes the streaming daemon had to stop streaming in about 50 times!
> This is 10 times more than the single drive case.
>
> While streaming, I observed both blkio.sectors and blkio.io_queued for
> both cgroups (the root node and prio). If only the streaming daemon is
> run (therefore fio is stopped), the sector count in prio/blkio.sectors
> increases while (root)/blkio.sectors does not. This confirms the
> streaming daemon is correctly identified as in the prio cgroup.
> Then, while both the streaming daemon and fio run, observing io_queued
> shows that for the root cgroup there is about 50 queued request in
> total (in average), while for the prio cgroup there is only one
> ocasional delayed request from time to time.
>
> $ uname -a
> Linux haswell1 3.10.10 #9 SMP PREEMPT Fri Nov 29 11:38:20 CET 2013
> i686 GNU/Linux
>
> Any ideas?
>
> Thanks,
> --
> Martin Boutin



-- 
Martin Boutin

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: cgroups-blkio CFQ scheduling does not work well in a RAID5 configuration.
       [not found]       ` <CAGqmV7pPb-Gh=B_Cjv0Az+PBxTHtPX=ECqkvFdd2Ej-hTfLqOQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-12-09  9:05         ` Martin Boutin
  2013-12-09 20:50           ` Stan Hoeppner
  0 siblings, 1 reply; 4+ messages in thread
From: Martin Boutin @ 2013-12-09  9:05 UTC (permalink / raw)
  To: CoolCold; +Cc: Jens Axboe, Kernel.org-Linux-RAID, Kernel.org-Linux-cgroups

Any thoughts here?

- Martin

On Sun, Dec 1, 2013 at 11:44 AM, CoolCold <coolthecold-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> I hope Neil will shed some light here, interesting question.
>
>
> On Fri, Nov 29, 2013 at 6:15 PM, Martin Boutin <martboutin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>> I forgot to suggest that this might have to do with md0_raid5 process.
>> The process has to take care of RAID parity for both processes
>> (streaming daemon and fio). By default it stays in the root cgroup
>> which means that RAID-related I/O will be unprioritized even for
>> processes in the prio cgroup, this might be introducing delays in the
>> I/O.
>> Otherwise I cannot put the md0_raid5 process in the prio cgroup either
>> because that would have RAID-related I/O from all other processes
>> stealing disk time from priority processes.
>>
>> On Fri, Nov 29, 2013 at 9:06 AM, Martin Boutin <martboutin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>> wrote:
>> > Hello list,
>> >
>> > Today I was trying to figure out how to get block I/O prioritization
>> > working for a certain process. The process is a streaming server that
>> > reads a big file stored in a filesystem (xfs) on top of a RAID5
>> > configuration using 3 disks, using O_DIRECT.
>> >
>> > I'm setting up cgroups this way:
>> > $ echo 1000 > /sys/fs/cgroup/blkio/prio/blkio.weight
>> > $ echo 10 > /sys/fs/cgroup/blkio/blkio.leaf_weight
>> >
>> > meaning that all the tasks in the prio cgroup will have unconstrained
>> > access time to the disk, while all the other tasks will have their
>> > disk access time weighted by a factor.
>> >
>> > If I ignore the RAID5 setup, create a XFS filesystem on /dev/sdb2,
>> > mount it on /data and put my streaming daemon in the prio cgroup and
>> > run the daemon by streaming around 250MiB/s of data, while I launch
>> > fio with disk I/O intensive tasks. For a period of 5 minutes, the
>> > streaming deamon had to stop streaming in about 5 times to rebuffer.
>> >
>> > Now, if I consider the same scenario but using the RAID5 device and
>> > letting the daemon stream 500MiB/s of data (because the RAID has
>> > around twice the throughput of a single drive), after a period of 5
>> > minutes the streaming daemon had to stop streaming in about 50 times!
>> > This is 10 times more than the single drive case.
>> >
>> > While streaming, I observed both blkio.sectors and blkio.io_queued for
>> > both cgroups (the root node and prio). If only the streaming daemon is
>> > run (therefore fio is stopped), the sector count in prio/blkio.sectors
>> > increases while (root)/blkio.sectors does not. This confirms the
>> > streaming daemon is correctly identified as in the prio cgroup.
>> > Then, while both the streaming daemon and fio run, observing io_queued
>> > shows that for the root cgroup there is about 50 queued request in
>> > total (in average), while for the prio cgroup there is only one
>> > ocasional delayed request from time to time.
>> >
>> > $ uname -a
>> > Linux haswell1 3.10.10 #9 SMP PREEMPT Fri Nov 29 11:38:20 CET 2013
>> > i686 GNU/Linux
>> >
>> > Any ideas?
>> >
>> > Thanks,
>> > --
>> > Martin Boutin
>>
>>
>>
>> --
>> Martin Boutin
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
>
> --
> Best regards,
> [COOLCOLD-RIPN]



-- 
Martin Boutin

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: cgroups-blkio CFQ scheduling does not work well in a RAID5 configuration.
  2013-12-09  9:05         ` Martin Boutin
@ 2013-12-09 20:50           ` Stan Hoeppner
  0 siblings, 0 replies; 4+ messages in thread
From: Stan Hoeppner @ 2013-12-09 20:50 UTC (permalink / raw)
  To: Martin Boutin, CoolCold
  Cc: Jens Axboe, Kernel.org-Linux-RAID, Kernel.org-Linux-cgroups

On 12/9/2013 3:05 AM, Martin Boutin wrote:
> Any thoughts here?

Your testing methodology is neither scientific nor thorough, and your
information is incomplete.  This may be why you're receiving no replies...

You suggest the problem is related to md because taking it out of the
loop shows "less breakage" of your streaming application.  However,
you're using XFS.  Thus this is applicable:

"As of kernel 3.2.12, the default i/o scheduler, CFQ, will defeat much
of the parallelization in XFS. "

http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E

It may simply be that the CFQ/XFS problem is manifesting itself more
prominently with md than single disk in your case.

Also, are you aligning XFS to the md geometry?  mkfs.xfs should have
aligned to md automatically but sometimes this may break.

I suggest testing with the deadline elevator.  I'd also suggest
benchmarking your disks individually and benchmarking the md0 RAID5
array and providing the results.  It's possible your RAID5 array is
actually performing slower than a single disk, but you don't know it
because you've not tested it.  What's your stripe_cache_size value?  The
default may be too low for your disks/array.  What is the configuration
of your RAID5 array?  Chunk size?

You said a single drive was streaming 250 MB/s.  That's impossible
unless you're using SSD.  If what you really meant is that you told your
streaming program to do 250MB/s then of course you'll get buffering as
the disks can't keep up with that rate, only about half that for a
single SATA drive.  You did not mention SSDs.  You didn't mention rust.
 You did not mention drive make/model/size.

> - Martin
> 
> On Sun, Dec 1, 2013 at 11:44 AM, CoolCold <coolthecold@gmail.com> wrote:
>> I hope Neil will shed some light here, interesting question.
>>
>>
>> On Fri, Nov 29, 2013 at 6:15 PM, Martin Boutin <martboutin@gmail.com> wrote:
>>>
>>> I forgot to suggest that this might have to do with md0_raid5 process.
>>> The process has to take care of RAID parity for both processes
>>> (streaming daemon and fio). By default it stays in the root cgroup
>>> which means that RAID-related I/O will be unprioritized even for
>>> processes in the prio cgroup, this might be introducing delays in the
>>> I/O.
>>> Otherwise I cannot put the md0_raid5 process in the prio cgroup either
>>> because that would have RAID-related I/O from all other processes
>>> stealing disk time from priority processes.
>>>
>>> On Fri, Nov 29, 2013 at 9:06 AM, Martin Boutin <martboutin@gmail.com>
>>> wrote:
>>>> Hello list,
>>>>
>>>> Today I was trying to figure out how to get block I/O prioritization
>>>> working for a certain process. The process is a streaming server that
>>>> reads a big file stored in a filesystem (xfs) on top of a RAID5
>>>> configuration using 3 disks, using O_DIRECT.
>>>>
>>>> I'm setting up cgroups this way:
>>>> $ echo 1000 > /sys/fs/cgroup/blkio/prio/blkio.weight
>>>> $ echo 10 > /sys/fs/cgroup/blkio/blkio.leaf_weight
>>>>
>>>> meaning that all the tasks in the prio cgroup will have unconstrained
>>>> access time to the disk, while all the other tasks will have their
>>>> disk access time weighted by a factor.
>>>>
>>>> If I ignore the RAID5 setup, create a XFS filesystem on /dev/sdb2,
>>>> mount it on /data and put my streaming daemon in the prio cgroup and
>>>> run the daemon by streaming around 250MiB/s of data, while I launch
>>>> fio with disk I/O intensive tasks. For a period of 5 minutes, the
>>>> streaming deamon had to stop streaming in about 5 times to rebuffer.
>>>>
>>>> Now, if I consider the same scenario but using the RAID5 device and
>>>> letting the daemon stream 500MiB/s of data (because the RAID has
>>>> around twice the throughput of a single drive), after a period of 5
>>>> minutes the streaming daemon had to stop streaming in about 50 times!
>>>> This is 10 times more than the single drive case.
>>>>
>>>> While streaming, I observed both blkio.sectors and blkio.io_queued for
>>>> both cgroups (the root node and prio). If only the streaming daemon is
>>>> run (therefore fio is stopped), the sector count in prio/blkio.sectors
>>>> increases while (root)/blkio.sectors does not. This confirms the
>>>> streaming daemon is correctly identified as in the prio cgroup.
>>>> Then, while both the streaming daemon and fio run, observing io_queued
>>>> shows that for the root cgroup there is about 50 queued request in
>>>> total (in average), while for the prio cgroup there is only one
>>>> ocasional delayed request from time to time.
>>>>
>>>> $ uname -a
>>>> Linux haswell1 3.10.10 #9 SMP PREEMPT Fri Nov 29 11:38:20 CET 2013
>>>> i686 GNU/Linux
>>>>
>>>> Any ideas?
>>>>
>>>> Thanks,
>>>> --
>>>> Martin Boutin
>>>
>>>
>>>
>>> --
>>> Martin Boutin
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>> --
>> Best regards,
>> [COOLCOLD-RIPN]
> 
> 
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-12-09 20:50 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-29 14:06 cgroups-blkio CFQ scheduling does not work well in a RAID5 configuration Martin Boutin
     [not found] ` <CACtJ3Hak1pvYLRDSC3MELeo_jV8_1Tp8rjte5SODyvNauoFv4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-29 14:15   ` Martin Boutin
     [not found]     ` <CAGqmV7pPb-Gh=B_Cjv0Az+PBxTHtPX=ECqkvFdd2Ej-hTfLqOQ@mail.gmail.com>
     [not found]       ` <CAGqmV7pPb-Gh=B_Cjv0Az+PBxTHtPX=ECqkvFdd2Ej-hTfLqOQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-12-09  9:05         ` Martin Boutin
2013-12-09 20:50           ` Stan Hoeppner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).