* Does XFS support cgroup writeback limiting?
@ 2015-11-23 11:05 Lutz Vieweg
2015-11-23 20:26 ` Dave Chinner
0 siblings, 1 reply; 14+ messages in thread
From: Lutz Vieweg @ 2015-11-23 11:05 UTC (permalink / raw)
To: xfs
Hi,
in June 2015 the article https://lwn.net/Articles/648292/ mentioned
upcoming support for limiting the quantity of buffered writes
using control groups.
Back then, only ext4 was said to support that feature, with other
filesystems requiring some minor changes to do the same.
The generic cgroup writeback support made it into mainline
linux-4.2, I tried to find information on whether other filesystems
had meanwhile been adapted, but couldn't find this piece of information.
Therefore I'd like to ask: Does XFS (as of linux-4.3 or linux-4.4)
support limiting the quantity of buffered writes using control groups?
Regards,
Lutz Vieweg
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting?
2015-11-23 11:05 Does XFS support cgroup writeback limiting? Lutz Vieweg
@ 2015-11-23 20:26 ` Dave Chinner
2015-11-23 22:08 ` Lutz Vieweg
0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2015-11-23 20:26 UTC (permalink / raw)
To: Lutz Vieweg; +Cc: xfs
On Mon, Nov 23, 2015 at 12:05:53PM +0100, Lutz Vieweg wrote:
> Hi,
>
> in June 2015 the article https://lwn.net/Articles/648292/ mentioned
> upcoming support for limiting the quantity of buffered writes
> using control groups.
>
> Back then, only ext4 was said to support that feature, with other
> filesystems requiring some minor changes to do the same.
Yes, changing the kernel code to support this functionality is about
3 lines of code. however....
> The generic cgroup writeback support made it into mainline
> linux-4.2, I tried to find information on whether other filesystems
> had meanwhile been adapted, but couldn't find this piece of information.
>
> Therefore I'd like to ask: Does XFS (as of linux-4.3 or linux-4.4)
> support limiting the quantity of buffered writes using control groups?
.... I haven't added support to XFS because I have no way of
verifying the functionality works and that it continues to work as
it is intended. i.e. we have no regression test coverage for cgroup
aware writeback and until someone writes a set of regression tests
that validate it's functionality works correctly it will remain this
way.
Writing code is trivial. Validating the code actually works as
intended and doesn't silently get broken in the future is the
hard part....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting?
2015-11-23 20:26 ` Dave Chinner
@ 2015-11-23 22:08 ` Lutz Vieweg
2015-11-23 23:20 ` Dave Chinner
0 siblings, 1 reply; 14+ messages in thread
From: Lutz Vieweg @ 2015-11-23 22:08 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 11/23/2015 09:26 PM, Dave Chinner wrote:
> On Mon, Nov 23, 2015 at 12:05:53PM +0100, Lutz Vieweg wrote:
>> in June 2015 the article https://lwn.net/Articles/648292/ mentioned
>> upcoming support for limiting the quantity of buffered writes
>> using control groups.
>>
>> Back then, only ext4 was said to support that feature, with other
>> filesystems requiring some minor changes to do the same.
>
> Yes, changing the kernel code to support this functionality is about
> 3 lines of code.
Oh, I didn't expect it to be such a small change :-)
> .... I haven't added support to XFS because I have no way of
> verifying the functionality works and that it continues to work as
> it is intended. i.e. we have no regression test coverage for cgroup
> aware writeback and until someone writes a set of regression tests
> that validate it's functionality works correctly it will remain this
> way.
>
> Writing code is trivial. Validating the code actually works as
> intended and doesn't silently get broken in the future is the
> hard part....
Understood, would you anyway be willing to publish such a
three-line-patch (outside of official releases) for those
daredevils (like me :-)) who'd be willing to give it a try?
After all, this functionality is the last piece of the
"isolation"-puzzle that is missing from Linux to actually
allow fencing off virtual machines or containers from DOSing
each other by using up all I/O bandwidth...
Regards,
Lutz Vieweg
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting?
2015-11-23 22:08 ` Lutz Vieweg
@ 2015-11-23 23:20 ` Dave Chinner
2015-11-25 18:28 ` Lutz Vieweg
0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2015-11-23 23:20 UTC (permalink / raw)
To: Lutz Vieweg; +Cc: xfs
On Mon, Nov 23, 2015 at 11:08:42PM +0100, Lutz Vieweg wrote:
> On 11/23/2015 09:26 PM, Dave Chinner wrote:
> >On Mon, Nov 23, 2015 at 12:05:53PM +0100, Lutz Vieweg wrote:
> >>in June 2015 the article https://lwn.net/Articles/648292/ mentioned
> >>upcoming support for limiting the quantity of buffered writes
> >>using control groups.
> >>
> >>Back then, only ext4 was said to support that feature, with other
> >>filesystems requiring some minor changes to do the same.
> >
> >Yes, changing the kernel code to support this functionality is about
> >3 lines of code.
>
> Oh, I didn't expect it to be such a small change :-)
>
> >.... I haven't added support to XFS because I have no way of
> >verifying the functionality works and that it continues to work as
> >it is intended. i.e. we have no regression test coverage for cgroup
> >aware writeback and until someone writes a set of regression tests
> >that validate it's functionality works correctly it will remain this
> >way.
> >
> >Writing code is trivial. Validating the code actually works as
> >intended and doesn't silently get broken in the future is the
> >hard part....
>
> Understood, would you anyway be willing to publish such a
> three-line-patch (outside of official releases) for those
> daredevils (like me :-)) who'd be willing to give it a try?
Just make the same mods to XFS as the ext4 patch here:
http://www.spinics.net/lists/kernel/msg2014816.html
> After all, this functionality is the last piece of the
> "isolation"-puzzle that is missing from Linux to actually
> allow fencing off virtual machines or containers from DOSing
> each other by using up all I/O bandwidth...
Yes, I know, but no-one seems to care enough about it to provide
regression tests for it.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting?
2015-11-23 23:20 ` Dave Chinner
@ 2015-11-25 18:28 ` Lutz Vieweg
2015-11-25 21:35 ` Dave Chinner
0 siblings, 1 reply; 14+ messages in thread
From: Lutz Vieweg @ 2015-11-25 18:28 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 11/24/2015 12:20 AM, Dave Chinner wrote:
> Just make the same mods to XFS as the ext4 patch here:
>
> http://www.spinics.net/lists/kernel/msg2014816.html
I read at http://www.spinics.net/lists/kernel/msg2014819.html
about this patch:
> Journal data which is written by jbd2 worker is left alone by
> this patch and will always be written out from the root cgroup.
If the same was done for XFS, wouldn't this mean a malicious
process could still stall other processes' attempts to write
to the filesystem by performing arbitrary amounts of meta-data
modifications in a tight loop?
>> After all, this functionality is the last piece of the
>> "isolation"-puzzle that is missing from Linux to actually
>> allow fencing off virtual machines or containers from DOSing
>> each other by using up all I/O bandwidth...
>
> Yes, I know, but no-one seems to care enough about it to provide
> regression tests for it.
Well, I could give it a try, if a shell script tinkering with
control groups parameters (which requires root privileges and
could easily stall the machine) would be considered adequate for
the purpose.
I would propose a test to be performed like this:
0) Identify a block device to test on. I guess some artificially
speed-limited DM device would be best?
Set the speed limit to X/100 MB per second, with X configurable.
1) Start 4 "good" plus 4 "evil" subprocesses competing for
write-bandwidth on the block device.
Assign the 4 "good" processes to two different control groups ("g1", "g2"),
assign the 4 "evil" processes to further two different control
groups ("e1", "e2"), so 4 control groups in total, with 2 tasks each.
2) Create 3 different XFS filesystem instances on the block
device, one for access by only the "good" processes,
on for access by only the "evil" processes, one for
shared access by at least two "good" and two "evil"
processes.
3) Behaviour of the processes:
"Good" processes will attempt to write a configured amount
of data (X MB) at 20% of the speed limit of the block device, modifying
meta-data at a moderate rate (like creating/renaming/deleting files
every few megabytes written).
Half of the "good" processes write to their "good-only" filesystem,
the other half writes to the "shared access" filesystem.
Half of the "evil" processes will attempt to write as much data
as possible into open files in a tight endless loop.
The other half of the "evil" processes will permanently
modify meta-data as quickly as possible, creating/renaming/deleting
lots of files, also in a tight endless loop.
Half of the "evil" processes writes to the "evil-only" filesystem,
the other half writes to the "shared access" filesystem.
4) Test 1: Configure all 4 control groups to allow for the same
buffered write rate percentage.
The test is successful if all "good processes" terminate successfully
after a time not longer than it would take to write 10 times X MB to the
rate-limited block device.
All processes to be killed after termination of all good processes or
some timeout. If the timeout is reached, the test is failed.
5) Test 2: Configure "e1" and "e2" to allow for "zero" buffered write rate.
The test is successful if the "good processes" terminate successfully
after a time not longer than it would take to write 5 times X MB to the
rate-limited block device.
All processes to be killed after termination of all good processes or
some timeout. If the timeout is reached, the test is failed.
6) Cleanup: unmount test filesystems, remove rate-limited DM device, remove
control groups.
What do you think, could this be a reasonable plan?
Regards,
Lutz Vieweg
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting?
2015-11-25 18:28 ` Lutz Vieweg
@ 2015-11-25 21:35 ` Dave Chinner
2015-11-29 21:41 ` Lutz Vieweg
2015-12-01 11:01 ` I/O 'owner' DoS probs (was Re: Does XFS support cgroup writeback limiting?) L.A. Walsh
0 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2015-11-25 21:35 UTC (permalink / raw)
To: Lutz Vieweg; +Cc: xfs
On Wed, Nov 25, 2015 at 07:28:42PM +0100, Lutz Vieweg wrote:
> On 11/24/2015 12:20 AM, Dave Chinner wrote:
> >Just make the same mods to XFS as the ext4 patch here:
> >
> >http://www.spinics.net/lists/kernel/msg2014816.html
>
> I read at http://www.spinics.net/lists/kernel/msg2014819.html
> about this patch:
>
> >Journal data which is written by jbd2 worker is left alone by
> >this patch and will always be written out from the root cgroup.
>
> If the same was done for XFS, wouldn't this mean a malicious
> process could still stall other processes' attempts to write
> to the filesystem by performing arbitrary amounts of meta-data
> modifications in a tight loop?
XFS doesn't have journal driver writeback, so no.
> >>After all, this functionality is the last piece of the
> >>"isolation"-puzzle that is missing from Linux to actually
> >>allow fencing off virtual machines or containers from DOSing
> >>each other by using up all I/O bandwidth...
> >
> >Yes, I know, but no-one seems to care enough about it to provide
> >regression tests for it.
>
> Well, I could give it a try, if a shell script tinkering with
> control groups parameters (which requires root privileges and
> could easily stall the machine) would be considered adequate for
> the purpose.
xfstests is where such tests need to live. It would need
infrastructure to set up control groups and bandwidth limits...
> I would propose a test to be performed like this:
>
> 0) Identify a block device to test on. I guess some artificially
> speed-limited DM device would be best?
> Set the speed limit to X/100 MB per second, with X configurable.
xfstests provides a scratch device that can be used for this.
>
> 1) Start 4 "good" plus 4 "evil" subprocesses competing for
> write-bandwidth on the block device.
> Assign the 4 "good" processes to two different control groups ("g1", "g2"),
> assign the 4 "evil" processes to further two different control
> groups ("e1", "e2"), so 4 control groups in total, with 2 tasks each.
>
> 2) Create 3 different XFS filesystem instances on the block
> device, one for access by only the "good" processes,
> on for access by only the "evil" processes, one for
> shared access by at least two "good" and two "evil"
> processes.
Why do you need multiple filesystems? The writeback throttling is
designed to work within a single filesystem...
I was thinking of something similar, but quite simple, using "bound"
and "unbound" (i.e. limited and unlimited) processes. e.g:
process 1 is unbound, does large sequential IO
process 2-N are bound to 1MB/s, do large sequential IO
Run for several minutes to reach a stable steady state behaviour.
if process 2-N do not receive 1MB/s throughput each, then throttling
of the unbound writeback processes is not working. Combinations
of this test using different read/write streams on each process
gives multiple tests, verifies that block IO control works for both
read and write IO, not just writeback throttling.
And then other combinations of this sort of test, such as also
binding process 1 is bound to, say, 20MB/s. Repeating the tests can
then tell us if fast and slow bindings are working correctly. i.e.
checking to ensure that process 1 doesn't exceed it's limits and all
the other streams stay within bounds, too.
> 3) Behaviour of the processes:
>
> "Good" processes will attempt to write a configured amount
> of data (X MB) at 20% of the speed limit of the block device, modifying
> meta-data at a moderate rate (like creating/renaming/deleting files
> every few megabytes written).
> Half of the "good" processes write to their "good-only" filesystem,
> the other half writes to the "shared access" filesystem.
>
> Half of the "evil" processes will attempt to write as much data
> as possible into open files in a tight endless loop.
> The other half of the "evil" processes will permanently
> modify meta-data as quickly as possible, creating/renaming/deleting
> lots of files, also in a tight endless loop.
> Half of the "evil" processes writes to the "evil-only" filesystem,
> the other half writes to the "shared access" filesystem.
Metadata IO not throttled - it is owned by the filesystem and hence
root cgroup. There is no point in running tests that do large
amounts of journal/metadata IO as it is this will result in
uncontrollable and unpredictabe IO patterns and hence will give
unreliable test results.
We want to test the data bandwidth control algorithms work
appropriately in a controlled, repeatable environment. Throwing all
sorts of uncontrollable IO at the device is a good /stress/ test, but
it is not going to tell us anything useful in terms of correctness
or reliably detect functional regressions.
> 4) Test 1: Configure all 4 control groups to allow for the same
> buffered write rate percentage.
>
> The test is successful if all "good processes" terminate successfully
> after a time not longer than it would take to write 10 times X MB to the
> rate-limited block device.
if we are rate limiting to 1MB/s, then a 10s test is not long enough
to reach steady state. Indeed, it's going to take at least 30s worth
of IO to guarantee that we getting writeback occurring for low
bandwidth streams....
i.e. the test needs to run for a period of time and then measure
the throughput of each stream, comparing it against the expected
throughput for the stream, rather than trying to write a fixed
bandwidth....
> 5) Test 2: Configure "e1" and "e2" to allow for "zero" buffered write rate.
>
> The test is successful if the "good processes" terminate successfully
> after a time not longer than it would take to write 5 times X MB to the
> rate-limited block device.
>
> All processes to be killed after termination of all good processes or
> some timeout. If the timeout is reached, the test is failed.
>
> 6) Cleanup: unmount test filesystems, remove rate-limited DM device, remove
> control groups.
control group cleanup will need to be added to the xfstests
infrastructure, but it handles everything else...
> What do you think, could this be a reasonable plan?
Yes, I think we can pull a reasonable set of baseline tests from an
approach like this.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting?
2015-11-25 21:35 ` Dave Chinner
@ 2015-11-29 21:41 ` Lutz Vieweg
2015-11-30 23:44 ` Dave Chinner
2015-12-01 8:38 ` automatic testing of cgroup writeback limiting (was: Re: Does XFS support cgroup writeback limiting?) Martin Steigerwald
2015-12-01 11:01 ` I/O 'owner' DoS probs (was Re: Does XFS support cgroup writeback limiting?) L.A. Walsh
1 sibling, 2 replies; 14+ messages in thread
From: Lutz Vieweg @ 2015-11-29 21:41 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 11/25/2015 10:35 PM, Dave Chinner wrote:
>> 2) Create 3 different XFS filesystem instances on the block
>> device, one for access by only the "good" processes,
>> on for access by only the "evil" processes, one for
>> shared access by at least two "good" and two "evil"
>> processes.
>
> Why do you need multiple filesystems? The writeback throttling is
> designed to work within a single filesystem...
Hmm. Previously, I thought that the limiting of buffered writes
was realized by keeping track of the owners of dirty pages, and
that filesystem support was just required to make sure that writing
via a filesystem did not "anonymize" the dirty data. From what
I had read in blkio-controller.txt it seemed evident that limitations
would be accounted for "per block device", not "per filesystem", and
options like
> echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
document how to configure limits per block device.
Now after reading through the new Writeback section of blkio-controller.txt
again I am somewhat confused - the text states
> writeback operates on inode basis
and if that means inodes as in "file system inodes", this would
indeed mean limits would be enforced "per filesystem" - and yet
there are no options documented to specify limits for any specific
filesystem.
Does this mean some process writing to a block device (not via filesystem)
without "O_DIRECT" will dirty buffer pages, but those will not be limited
(as they are neither synchronous nor via-filesystem writes)?
That would mean VMs sharing some (physical or abstract) block device could
not really be isolated regarding their asynchronous write I/O...
> Metadata IO not throttled - it is owned by the filesystem and hence
> root cgroup.
Ouch. That kind of defeats the purpose of limiting evil processes'
ability to DOS other processes.
Wouldn't it be possible to assign some arbitrary cost to meta-data
operations - like "account one page write for each meta-data change
to the originating process of that change"? While certainly not
allowing for limiting to byte-precise limits of write bandwidth,
this would regain the ability to defend against DOS situations,
and for well-behaved processes, the "cost" accounted for their not-so-frequent
meta-data operations would probably not really hurt their writing
speed.
>> The test is successful if all "good processes" terminate successfully
>> after a time not longer than it would take to write 10 times X MB to the
>> rate-limited block device.
>
> if we are rate limiting to 1MB/s, then a 10s test is not long enough
> to reach steady state. Indeed, it's going to take at least 30s worth
> of IO to guarantee that we getting writeback occurring for low
> bandwidth streams....
Sure, the "X/100 MB per second" throttle to the scratch device
was meant to result in a minimal test time of > 100s.
> i.e. the test needs to run for a period of time and then measure
> the throughput of each stream, comparing it against the expected
> throughput for the stream, rather than trying to write a fixed
> bandwidth....
The reason why I thought it to be a good idea to have the "good" processes
use only a limited write rate was to make sure that the actual write
activity of those processes is spread out over enough time to make sure
that they could, after all, feel some "pressure back" from the operating
system that is applied only after the "bad" processes have filled up
all RAM dedicated to dirty buffer cache.
Assume the test instance has lots of memory and would be willing to
spend many Gigabytes of RAM for dirty buffer caches. Chances are that
in such a situation the "good" processes might be done writing their
limited amount of data almost instantaneously, because the data just
went to RAM.
(I understand that if one used the absolute "blkio.throttle.write*" options
pressure back could apply before the dirty buffer cache was maxed out,
but in real-world scenarios people will almost always use the relative
"blkio.weight" based limiting, after all, you usually don't want to throttle
processes if there is plenty of bandwidth left no other process wants
at the same time.)
Regards,
Lutz Vieweg
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting?
2015-11-29 21:41 ` Lutz Vieweg
@ 2015-11-30 23:44 ` Dave Chinner
2015-12-01 8:38 ` automatic testing of cgroup writeback limiting (was: Re: Does XFS support cgroup writeback limiting?) Martin Steigerwald
1 sibling, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2015-11-30 23:44 UTC (permalink / raw)
To: Lutz Vieweg; +Cc: xfs
On Sun, Nov 29, 2015 at 10:41:13PM +0100, Lutz Vieweg wrote:
> On 11/25/2015 10:35 PM, Dave Chinner wrote:
> >>2) Create 3 different XFS filesystem instances on the block
> >> device, one for access by only the "good" processes,
> >> on for access by only the "evil" processes, one for
> >> shared access by at least two "good" and two "evil"
> >> processes.
> >
> >Why do you need multiple filesystems? The writeback throttling is
> >designed to work within a single filesystem...
>
> Hmm. Previously, I thought that the limiting of buffered writes
> was realized by keeping track of the owners of dirty pages, and
> that filesystem support was just required to make sure that writing
> via a filesystem did not "anonymize" the dirty data. From what
> I had read in blkio-controller.txt it seemed evident that limitations
> would be accounted for "per block device", not "per filesystem", and
> options like
> >echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device
> document how to configure limits per block device.
>
> Now after reading through the new Writeback section of blkio-controller.txt
> again I am somewhat confused - the text states
> >writeback operates on inode basis
> and if that means inodes as in "file system inodes", this would
> indeed mean limits would be enforced "per filesystem" - and yet
> there are no options documented to specify limits for any specific
> filesystem.
>
> Does this mean some process writing to a block device (not via filesystem)
> without "O_DIRECT" will dirty buffer pages, but those will not be limited
> (as they are neither synchronous nor via-filesystem writes)?
> That would mean VMs sharing some (physical or abstract) block device could
> not really be isolated regarding their asynchronous write I/O...
You are asking the wrong person - I don't know how this is all
supposed to work, how it's supposed to be configured, how different
cgroup controllers are supposed to interact, etc. Hence my request
for regression tests before we say "XFS supports ...." because
without them I have no idea if something is desired/correct
behaviour or not...
> >Metadata IO not throttled - it is owned by the filesystem and hence
> >root cgroup.
>
> Ouch. That kind of defeats the purpose of limiting evil processes'
> ability to DOS other processes.
> Wouldn't it be possible to assign some arbitrary cost to meta-data
> operations - like "account one page write for each meta-data change
No. Think of a file with millions of extents. Just reading a byte of
data will require pulling the entire extent map into memory, and so
doing hundreds of megabytes of IO and using that much memory.
> to the originating process of that change"? While certainly not
> allowing for limiting to byte-precise limits of write bandwidth,
> this would regain the ability to defend against DOS situations,
No, that won't help at all. In fact, it might even introduce new DOS
situations where we block a global metadata operation because it
doesn't have reservation space and so *everything* stops until that
metadata IO is dispatched and completed...
> Assume the test instance has lots of memory and would be willing to
> spend many Gigabytes of RAM for dirty buffer caches.
Most people will be running the tests on machines with limited
RAM and disk space, and so the tests really cannot depend on having
many multiple gigabytes of RAM available for correct operation....
Indeed, limiting dirty memory thresholds will be part of setting up
a predictable, reliable test scenario (e.g. via
/proc/sys/vm/dirty_bytes and friends).
> (I understand that if one used the absolute "blkio.throttle.write*" options
> pressure back could apply before the dirty buffer cache was maxed out,
> but in real-world scenarios people will almost always use the relative
> "blkio.weight" based limiting, after all, you usually don't want to throttle
> processes if there is plenty of bandwidth left no other process wants
> at the same time.)
Again, I have no idea how the throttling works or is configured, so
we need regression tests to cover both (all?) of these sorts of
common configuration scenarios.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* automatic testing of cgroup writeback limiting (was: Re: Does XFS support cgroup writeback limiting?)
2015-11-29 21:41 ` Lutz Vieweg
2015-11-30 23:44 ` Dave Chinner
@ 2015-12-01 8:38 ` Martin Steigerwald
2015-12-01 16:38 ` Tejun Heo
1 sibling, 1 reply; 14+ messages in thread
From: Martin Steigerwald @ 2015-12-01 8:38 UTC (permalink / raw)
To: xfs, Tejun Heo; +Cc: linux-fsdevel, Lutz Vieweg
I think it makes sense to include those that wrote the cgroup writeback
limiting into the discussion on how to automatically test this feature. Thus
add fsdevel and Tejun to CC.
Am Sonntag, 29. November 2015, 22:41:13 CET schrieb Lutz Vieweg:
> On 11/25/2015 10:35 PM, Dave Chinner wrote:
> >> 2) Create 3 different XFS filesystem instances on the block
> >>
> >> device, one for access by only the "good" processes,
> >> on for access by only the "evil" processes, one for
> >> shared access by at least two "good" and two "evil"
> >> processes.
> >
> > Why do you need multiple filesystems? The writeback throttling is
> > designed to work within a single filesystem...
>
> Hmm. Previously, I thought that the limiting of buffered writes
> was realized by keeping track of the owners of dirty pages, and
> that filesystem support was just required to make sure that writing
> via a filesystem did not "anonymize" the dirty data. From what
> I had read in blkio-controller.txt it seemed evident that limitations
> would be accounted for "per block device", not "per filesystem", and
> options like
>
> > echo "<major>:<minor> <rate_bytes_per_second>" >
> > /cgrp/blkio.throttle.read_bps_device
> document how to configure limits per block device.
>
> Now after reading through the new Writeback section of blkio-controller.txt
> again I am somewhat confused - the text states
>
> > writeback operates on inode basis
>
> and if that means inodes as in "file system inodes", this would
> indeed mean limits would be enforced "per filesystem" - and yet
> there are no options documented to specify limits for any specific
> filesystem.
>
> Does this mean some process writing to a block device (not via filesystem)
> without "O_DIRECT" will dirty buffer pages, but those will not be limited
> (as they are neither synchronous nor via-filesystem writes)?
> That would mean VMs sharing some (physical or abstract) block device could
> not really be isolated regarding their asynchronous write I/O...
>
> > Metadata IO not throttled - it is owned by the filesystem and hence
> > root cgroup.
>
> Ouch. That kind of defeats the purpose of limiting evil processes'
> ability to DOS other processes.
> Wouldn't it be possible to assign some arbitrary cost to meta-data
> operations - like "account one page write for each meta-data change
> to the originating process of that change"? While certainly not
> allowing for limiting to byte-precise limits of write bandwidth,
> this would regain the ability to defend against DOS situations,
> and for well-behaved processes, the "cost" accounted for their
> not-so-frequent meta-data operations would probably not really hurt their
> writing
> speed.
>
> >> The test is successful if all "good processes" terminate successfully
> >>
> >> after a time not longer than it would take to write 10 times X MB to
> >> the
> >> rate-limited block device.
> >
> > if we are rate limiting to 1MB/s, then a 10s test is not long enough
> > to reach steady state. Indeed, it's going to take at least 30s worth
> > of IO to guarantee that we getting writeback occurring for low
> > bandwidth streams....
>
> Sure, the "X/100 MB per second" throttle to the scratch device
> was meant to result in a minimal test time of > 100s.
>
> > i.e. the test needs to run for a period of time and then measure
> > the throughput of each stream, comparing it against the expected
> > throughput for the stream, rather than trying to write a fixed
> > bandwidth....
>
> The reason why I thought it to be a good idea to have the "good" processes
> use only a limited write rate was to make sure that the actual write
> activity of those processes is spread out over enough time to make sure
> that they could, after all, feel some "pressure back" from the operating
> system that is applied only after the "bad" processes have filled up
> all RAM dedicated to dirty buffer cache.
>
> Assume the test instance has lots of memory and would be willing to
> spend many Gigabytes of RAM for dirty buffer caches. Chances are that
> in such a situation the "good" processes might be done writing their
> limited amount of data almost instantaneously, because the data just
> went to RAM.
>
> (I understand that if one used the absolute "blkio.throttle.write*" options
> pressure back could apply before the dirty buffer cache was maxed out,
> but in real-world scenarios people will almost always use the relative
> "blkio.weight" based limiting, after all, you usually don't want to throttle
> processes if there is plenty of bandwidth left no other process wants at
> the same time.)
>
>
> Regards,
>
> Lutz Vieweg
>
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* I/O 'owner' DoS probs (was Re: Does XFS support cgroup writeback limiting?)
2015-11-25 21:35 ` Dave Chinner
2015-11-29 21:41 ` Lutz Vieweg
@ 2015-12-01 11:01 ` L.A. Walsh
2015-12-01 20:18 ` Dave Chinner
1 sibling, 1 reply; 14+ messages in thread
From: L.A. Walsh @ 2015-12-01 11:01 UTC (permalink / raw)
To: Dave Chinner, xfs
Dave Chinner wrote:
> Metadata IO not throttled - it is owned by the filesystem and hence
> root cgroup.
----
Please forgive me if this is "obvious"....
If it is owned by the file system, then logically, such meta
data wouldn't count against the user's quota limits? That 'could'
be justified on the basis that the format of the metadata is outside
the control of file owner and depending on what's supported, included
or logged in metadata, the size could vary greatly -- all outside of the
direct control of the user. However, by adding extended attributes,
the user can, likely, affect the disk space used by the metadata
regardless of how it implemented and stored on disk -- so in that respect
it is user data. Similarly, system CPU time is attributed to the user
causing the CPU time to be spent, and so on.
To highlight the type of problems this can cause: starting after
Windows XP, MS decided to combine network file I/O requests over 64K to
be dispatched to the 'System' daemon to optimize network I/O from
multiple processes using bigger windows to try to keep the network pipe
as full as possible (competing, of course, with QOS for conforming
apps).
This created a problem -- applications that flood the network
sit mostly idle waiting for their rpc call to finish and all their
BW is attributed to System.
FWIW -- you cannot change any of the priorities for System
without it going all blue screen on you telling you that it's crashing
to protect you -- because something changed one of it's priority number
(IO/cpu/memory...etc).
Problem is that any intensive prioritizing slows everything down
and the entire computer becomes unresponsive (10Gb link) as network
requests are reduced in size. Usually it is possible to restore
normality by restarting Explorer (assuming you have enough cpu time to
do so). About 10% of the time, pulling the network cord is required to
stop the storm, and VERY rarely, <.01% of the time power-cycling is
required.
Attributing user-generated I/O to system processes that are "unaccountable" *can* and does cause DoS "opportunities"....
Is this metadata really I/O that is completely disconnected from
a user that they cannot affect?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: automatic testing of cgroup writeback limiting (was: Re: Does XFS support cgroup writeback limiting?)
2015-12-01 8:38 ` automatic testing of cgroup writeback limiting (was: Re: Does XFS support cgroup writeback limiting?) Martin Steigerwald
@ 2015-12-01 16:38 ` Tejun Heo
2015-12-03 0:18 ` automatic testing of cgroup writeback limiting Lutz Vieweg
0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2015-12-01 16:38 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: linux-fsdevel, Lutz Vieweg, xfs
Hello,
On Tue, Dec 01, 2015 at 09:38:03AM +0100, Martin Steigerwald wrote:
> > > echo "<major>:<minor> <rate_bytes_per_second>" >
> > > /cgrp/blkio.throttle.read_bps_device
> > document how to configure limits per block device.
> >
> > Now after reading through the new Writeback section of blkio-controller.txt
> > again I am somewhat confused - the text states
> >
> > > writeback operates on inode basis
As opposed to pages. cgroup ownership is tracked per inode, not per
page, so if multiple cgroups write to the same inode at the same time,
some IOs will be incorrectly attributed.
> > and if that means inodes as in "file system inodes", this would
> > indeed mean limits would be enforced "per filesystem" - and yet
> > there are no options documented to specify limits for any specific
> > filesystem.
cgroup ownership is per-inode. IO throttling is per-device, so as
long as multiple filesystems map to the same device, they fall under
the same limit.
> > > Metadata IO not throttled - it is owned by the filesystem and hence
> > > root cgroup.
> >
> > Ouch. That kind of defeats the purpose of limiting evil processes'
> > ability to DOS other processes.
cgroup isn't a security mechanism and has to make active tradeoffs
between isolation and overhead. It doesn't provide protection against
malicious users and in general it's a pretty bad idea to depend on
cgroup for protection against hostile entities. Although some
controller do better isolation than others, given how filesystems are
implemented, filesystem io control getting there will likely take a
while.
> > Wouldn't it be possible to assign some arbitrary cost to meta-data
> > operations - like "account one page write for each meta-data change
> > to the originating process of that change"? While certainly not
> > allowing for limiting to byte-precise limits of write bandwidth,
> > this would regain the ability to defend against DOS situations,
> > and for well-behaved processes, the "cost" accounted for their
> > not-so-frequent meta-data operations would probably not really hurt their
> > writing
> > speed.
For aggregate consumers, this sort of approaches does make sense -
measure total consumption by common operations and distribute the
charges afterwards; however, this will require quite a bit of work on
both io controller and filesystem sides.
> > (I understand that if one used the absolute "blkio.throttle.write*" options
> > pressure back could apply before the dirty buffer cache was maxed out,
> > but in real-world scenarios people will almost always use the relative
> > "blkio.weight" based limiting, after all, you usually don't want to throttle
> > processes if there is plenty of bandwidth left no other process wants at
> > the same time.)
I'd recommend configuring both memory.high and io.weight so that the
buffer area isn't crazy high compared to io bandwidth. It should be
able to reach the configured ratio that way and also avoids two io
domains competing in the same io domain which can skew the results.
Thanks.
--
tejun
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: I/O 'owner' DoS probs (was Re: Does XFS support cgroup writeback limiting?)
2015-12-01 11:01 ` I/O 'owner' DoS probs (was Re: Does XFS support cgroup writeback limiting?) L.A. Walsh
@ 2015-12-01 20:18 ` Dave Chinner
0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2015-12-01 20:18 UTC (permalink / raw)
To: L.A. Walsh; +Cc: xfs
On Tue, Dec 01, 2015 at 03:01:23AM -0800, L.A. Walsh wrote:
>
>
> Dave Chinner wrote:
> >Metadata IO not throttled - it is owned by the filesystem and hence
> >root cgroup.
> ----
> Please forgive me if this is "obvious"....
>
> If it is owned by the file system, then logically, such meta
> data wouldn't count against the user's quota limits?
Filesystem space usage accounting is not the same thing. It's just
accounting, and has nothign to do with how the metadata is dependent
on the transaction subsystem, caching, locking, parent/child
relationships with other metadata in the filesystem, etc.
> Is this metadata really I/O that is completely disconnected from
> a user that they cannot affect?
throttling metadata IO - read or write - can lead to entire
filesystem stalls/deadlocks. e.g. transaction reservation cannot
complete because the inode that pins the tail of the log is blocked
from being written due to the process that needs to write it being
throttled by the IO controller. Hence every transaction in the
filesytem stalls until the IO controller allows the critical
metadata IO to be dispatched and completed...
Or, alternatively, a process is trying to allocate a block, and
holds an AG locked. It then tries to read a btree block, which is
throttled and blocked for some time. Any other allocation that needs
to take place in that AG is now blocked until the first allocation
is completed.
i.e. the moment we start throttling global metadata IO based on
per-process/cgroup limits, we end up with priority inversions and
partial/complete fs stalls all over the place. You can educate/ban
stupid users, but we can't easily iprevent stalls due to
IO level priority inversions we have no control over at the
filesystem level...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: automatic testing of cgroup writeback limiting
2015-12-01 16:38 ` Tejun Heo
@ 2015-12-03 0:18 ` Lutz Vieweg
2015-12-03 15:38 ` Tejun Heo
0 siblings, 1 reply; 14+ messages in thread
From: Lutz Vieweg @ 2015-12-03 0:18 UTC (permalink / raw)
To: Tejun Heo, Martin Steigerwald; +Cc: linux-fsdevel, xfs
On 12/01/2015 05:38 PM, Tejun Heo wrote:
> As opposed to pages. cgroup ownership is tracked per inode, not per
> page, so if multiple cgroups write to the same inode at the same time,
> some IOs will be incorrectly attributed.
I can't think of use cases where this could become a problem.
If more than one user/container/VM is allowed to write to the
same file at any one time, isolation is probably absent anyway ;-)
> cgroup ownership is per-inode. IO throttling is per-device, so as
> long as multiple filesystems map to the same device, they fall under
> the same limit.
Good, that's why I assumed it useful to include a scenario with more
than one filesystem on the same device into the test scenario, just
to know whether there are unexpected issues if more than one filesystem
utilizes the same underlying device.
>>>> Metadata IO not throttled - it is owned by the filesystem and hence
>>>> root cgroup.
>>>
>>> Ouch. That kind of defeats the purpose of limiting evil processes'
>>> ability to DOS other processes.
>
> cgroup isn't a security mechanism and has to make active tradeoffs
> between isolation and overhead. It doesn't provide protection against
> malicious users and in general it's a pretty bad idea to depend on
> cgroup for protection against hostile entities.
I wrote of "evil" processes for simplicity, but 99 out of 100 times
it's not intentional "evilness" that makes a process exhaust I/O
bandwidth of some device shared with other users/containers/VMs, it's
usually just bugs, inconsiderate programming or inappropriate use
that makes one process write like crazy, making other
users/containers/VMs suffer.
Whereever strict service level guarantees are relevant, and
applications require writing to storage, you currently cannot
consolidate two or more applications onto the same physical host,
even if they run under separate users/containers/VMs.
I understand there is no short or medium term solution that
would allow to isolate processes writing to the same filesytem
(because of the meta data writing), but is it correct to say
that at least VMs, which do not allow the virtual guest to
cause extensive meta data writes on the physical host, only
writes into pre-allocated image files, can be safely isolated
by the new "buffered write accounting"?
If so, we'd have stay away from user or container based isolation
of independently SLA'd applications, but could at least resort to VMs
using image files on a shared filesystem.
Regards,
Lutz Vieweg
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: automatic testing of cgroup writeback limiting
2015-12-03 0:18 ` automatic testing of cgroup writeback limiting Lutz Vieweg
@ 2015-12-03 15:38 ` Tejun Heo
0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2015-12-03 15:38 UTC (permalink / raw)
To: Lutz Vieweg; +Cc: linux-fsdevel, xfs
Hello, Lutz.
On Thu, Dec 03, 2015 at 01:18:48AM +0100, Lutz Vieweg wrote:
> On 12/01/2015 05:38 PM, Tejun Heo wrote:
> >As opposed to pages. cgroup ownership is tracked per inode, not per
> >page, so if multiple cgroups write to the same inode at the same time,
> >some IOs will be incorrectly attributed.
>
> I can't think of use cases where this could become a problem.
> If more than one user/container/VM is allowed to write to the
> same file at any one time, isolation is probably absent anyway ;-)
Yeap, that's why the trade-off was made.
> >cgroup ownership is per-inode. IO throttling is per-device, so as
> >long as multiple filesystems map to the same device, they fall under
> >the same limit.
>
> Good, that's why I assumed it useful to include a scenario with more
> than one filesystem on the same device into the test scenario, just
> to know whether there are unexpected issues if more than one filesystem
> utilizes the same underlying device.
Sure, I'd recommend including multiple writers on a single filesystem
case too as that exposes entanglement in metadata handling. That
should expose problems in more places.
> I wrote of "evil" processes for simplicity, but 99 out of 100 times
> it's not intentional "evilness" that makes a process exhaust I/O
> bandwidth of some device shared with other users/containers/VMs, it's
> usually just bugs, inconsiderate programming or inappropriate use
> that makes one process write like crazy, making other
> users/containers/VMs suffer.
Right now, what cgroup writeback can control is well-behaving
workloads which aren't dominated by metadata writeback. We still have
ways to go but it still is a huge leap compared to what we had before.
> Whereever strict service level guarantees are relevant, and
> applications require writing to storage, you currently cannot
> consolidate two or more applications onto the same physical host,
> even if they run under separate users/containers/VMs.
You're right. It can't do isolation well enough for things like
strict service level guarantee.
> I understand there is no short or medium term solution that
> would allow to isolate processes writing to the same filesytem
> (because of the meta data writing), but is it correct to say
> that at least VMs, which do not allow the virtual guest to
> cause extensive meta data writes on the physical host, only
> writes into pre-allocated image files, can be safely isolated
> by the new "buffered write accounting"?
Sure, that or loop mounts. Pure data accesses should be fairly well
isolated.
Thanks.
--
tejun
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-12-03 15:38 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-11-23 11:05 Does XFS support cgroup writeback limiting? Lutz Vieweg
2015-11-23 20:26 ` Dave Chinner
2015-11-23 22:08 ` Lutz Vieweg
2015-11-23 23:20 ` Dave Chinner
2015-11-25 18:28 ` Lutz Vieweg
2015-11-25 21:35 ` Dave Chinner
2015-11-29 21:41 ` Lutz Vieweg
2015-11-30 23:44 ` Dave Chinner
2015-12-01 8:38 ` automatic testing of cgroup writeback limiting (was: Re: Does XFS support cgroup writeback limiting?) Martin Steigerwald
2015-12-01 16:38 ` Tejun Heo
2015-12-03 0:18 ` automatic testing of cgroup writeback limiting Lutz Vieweg
2015-12-03 15:38 ` Tejun Heo
2015-12-01 11:01 ` I/O 'owner' DoS probs (was Re: Does XFS support cgroup writeback limiting?) L.A. Walsh
2015-12-01 20:18 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox