* Does XFS support cgroup writeback limiting? @ 2015-11-23 11:05 Lutz Vieweg 2015-11-23 20:26 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Lutz Vieweg @ 2015-11-23 11:05 UTC (permalink / raw) To: xfs Hi, in June 2015 the article https://lwn.net/Articles/648292/ mentioned upcoming support for limiting the quantity of buffered writes using control groups. Back then, only ext4 was said to support that feature, with other filesystems requiring some minor changes to do the same. The generic cgroup writeback support made it into mainline linux-4.2, I tried to find information on whether other filesystems had meanwhile been adapted, but couldn't find this piece of information. Therefore I'd like to ask: Does XFS (as of linux-4.3 or linux-4.4) support limiting the quantity of buffered writes using control groups? Regards, Lutz Vieweg _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting? 2015-11-23 11:05 Does XFS support cgroup writeback limiting? Lutz Vieweg @ 2015-11-23 20:26 ` Dave Chinner 2015-11-23 22:08 ` Lutz Vieweg 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2015-11-23 20:26 UTC (permalink / raw) To: Lutz Vieweg; +Cc: xfs On Mon, Nov 23, 2015 at 12:05:53PM +0100, Lutz Vieweg wrote: > Hi, > > in June 2015 the article https://lwn.net/Articles/648292/ mentioned > upcoming support for limiting the quantity of buffered writes > using control groups. > > Back then, only ext4 was said to support that feature, with other > filesystems requiring some minor changes to do the same. Yes, changing the kernel code to support this functionality is about 3 lines of code. however.... > The generic cgroup writeback support made it into mainline > linux-4.2, I tried to find information on whether other filesystems > had meanwhile been adapted, but couldn't find this piece of information. > > Therefore I'd like to ask: Does XFS (as of linux-4.3 or linux-4.4) > support limiting the quantity of buffered writes using control groups? .... I haven't added support to XFS because I have no way of verifying the functionality works and that it continues to work as it is intended. i.e. we have no regression test coverage for cgroup aware writeback and until someone writes a set of regression tests that validate it's functionality works correctly it will remain this way. Writing code is trivial. Validating the code actually works as intended and doesn't silently get broken in the future is the hard part.... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting? 2015-11-23 20:26 ` Dave Chinner @ 2015-11-23 22:08 ` Lutz Vieweg 2015-11-23 23:20 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Lutz Vieweg @ 2015-11-23 22:08 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On 11/23/2015 09:26 PM, Dave Chinner wrote: > On Mon, Nov 23, 2015 at 12:05:53PM +0100, Lutz Vieweg wrote: >> in June 2015 the article https://lwn.net/Articles/648292/ mentioned >> upcoming support for limiting the quantity of buffered writes >> using control groups. >> >> Back then, only ext4 was said to support that feature, with other >> filesystems requiring some minor changes to do the same. > > Yes, changing the kernel code to support this functionality is about > 3 lines of code. Oh, I didn't expect it to be such a small change :-) > .... I haven't added support to XFS because I have no way of > verifying the functionality works and that it continues to work as > it is intended. i.e. we have no regression test coverage for cgroup > aware writeback and until someone writes a set of regression tests > that validate it's functionality works correctly it will remain this > way. > > Writing code is trivial. Validating the code actually works as > intended and doesn't silently get broken in the future is the > hard part.... Understood, would you anyway be willing to publish such a three-line-patch (outside of official releases) for those daredevils (like me :-)) who'd be willing to give it a try? After all, this functionality is the last piece of the "isolation"-puzzle that is missing from Linux to actually allow fencing off virtual machines or containers from DOSing each other by using up all I/O bandwidth... Regards, Lutz Vieweg _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting? 2015-11-23 22:08 ` Lutz Vieweg @ 2015-11-23 23:20 ` Dave Chinner 2015-11-25 18:28 ` Lutz Vieweg 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2015-11-23 23:20 UTC (permalink / raw) To: Lutz Vieweg; +Cc: xfs On Mon, Nov 23, 2015 at 11:08:42PM +0100, Lutz Vieweg wrote: > On 11/23/2015 09:26 PM, Dave Chinner wrote: > >On Mon, Nov 23, 2015 at 12:05:53PM +0100, Lutz Vieweg wrote: > >>in June 2015 the article https://lwn.net/Articles/648292/ mentioned > >>upcoming support for limiting the quantity of buffered writes > >>using control groups. > >> > >>Back then, only ext4 was said to support that feature, with other > >>filesystems requiring some minor changes to do the same. > > > >Yes, changing the kernel code to support this functionality is about > >3 lines of code. > > Oh, I didn't expect it to be such a small change :-) > > >.... I haven't added support to XFS because I have no way of > >verifying the functionality works and that it continues to work as > >it is intended. i.e. we have no regression test coverage for cgroup > >aware writeback and until someone writes a set of regression tests > >that validate it's functionality works correctly it will remain this > >way. > > > >Writing code is trivial. Validating the code actually works as > >intended and doesn't silently get broken in the future is the > >hard part.... > > Understood, would you anyway be willing to publish such a > three-line-patch (outside of official releases) for those > daredevils (like me :-)) who'd be willing to give it a try? Just make the same mods to XFS as the ext4 patch here: http://www.spinics.net/lists/kernel/msg2014816.html > After all, this functionality is the last piece of the > "isolation"-puzzle that is missing from Linux to actually > allow fencing off virtual machines or containers from DOSing > each other by using up all I/O bandwidth... Yes, I know, but no-one seems to care enough about it to provide regression tests for it. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting? 2015-11-23 23:20 ` Dave Chinner @ 2015-11-25 18:28 ` Lutz Vieweg 2015-11-25 21:35 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Lutz Vieweg @ 2015-11-25 18:28 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On 11/24/2015 12:20 AM, Dave Chinner wrote: > Just make the same mods to XFS as the ext4 patch here: > > http://www.spinics.net/lists/kernel/msg2014816.html I read at http://www.spinics.net/lists/kernel/msg2014819.html about this patch: > Journal data which is written by jbd2 worker is left alone by > this patch and will always be written out from the root cgroup. If the same was done for XFS, wouldn't this mean a malicious process could still stall other processes' attempts to write to the filesystem by performing arbitrary amounts of meta-data modifications in a tight loop? >> After all, this functionality is the last piece of the >> "isolation"-puzzle that is missing from Linux to actually >> allow fencing off virtual machines or containers from DOSing >> each other by using up all I/O bandwidth... > > Yes, I know, but no-one seems to care enough about it to provide > regression tests for it. Well, I could give it a try, if a shell script tinkering with control groups parameters (which requires root privileges and could easily stall the machine) would be considered adequate for the purpose. I would propose a test to be performed like this: 0) Identify a block device to test on. I guess some artificially speed-limited DM device would be best? Set the speed limit to X/100 MB per second, with X configurable. 1) Start 4 "good" plus 4 "evil" subprocesses competing for write-bandwidth on the block device. Assign the 4 "good" processes to two different control groups ("g1", "g2"), assign the 4 "evil" processes to further two different control groups ("e1", "e2"), so 4 control groups in total, with 2 tasks each. 2) Create 3 different XFS filesystem instances on the block device, one for access by only the "good" processes, on for access by only the "evil" processes, one for shared access by at least two "good" and two "evil" processes. 3) Behaviour of the processes: "Good" processes will attempt to write a configured amount of data (X MB) at 20% of the speed limit of the block device, modifying meta-data at a moderate rate (like creating/renaming/deleting files every few megabytes written). Half of the "good" processes write to their "good-only" filesystem, the other half writes to the "shared access" filesystem. Half of the "evil" processes will attempt to write as much data as possible into open files in a tight endless loop. The other half of the "evil" processes will permanently modify meta-data as quickly as possible, creating/renaming/deleting lots of files, also in a tight endless loop. Half of the "evil" processes writes to the "evil-only" filesystem, the other half writes to the "shared access" filesystem. 4) Test 1: Configure all 4 control groups to allow for the same buffered write rate percentage. The test is successful if all "good processes" terminate successfully after a time not longer than it would take to write 10 times X MB to the rate-limited block device. All processes to be killed after termination of all good processes or some timeout. If the timeout is reached, the test is failed. 5) Test 2: Configure "e1" and "e2" to allow for "zero" buffered write rate. The test is successful if the "good processes" terminate successfully after a time not longer than it would take to write 5 times X MB to the rate-limited block device. All processes to be killed after termination of all good processes or some timeout. If the timeout is reached, the test is failed. 6) Cleanup: unmount test filesystems, remove rate-limited DM device, remove control groups. What do you think, could this be a reasonable plan? Regards, Lutz Vieweg _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting? 2015-11-25 18:28 ` Lutz Vieweg @ 2015-11-25 21:35 ` Dave Chinner 2015-11-29 21:41 ` Lutz Vieweg 2015-12-01 11:01 ` I/O 'owner' DoS probs (was Re: Does XFS support cgroup writeback limiting?) L.A. Walsh 0 siblings, 2 replies; 14+ messages in thread From: Dave Chinner @ 2015-11-25 21:35 UTC (permalink / raw) To: Lutz Vieweg; +Cc: xfs On Wed, Nov 25, 2015 at 07:28:42PM +0100, Lutz Vieweg wrote: > On 11/24/2015 12:20 AM, Dave Chinner wrote: > >Just make the same mods to XFS as the ext4 patch here: > > > >http://www.spinics.net/lists/kernel/msg2014816.html > > I read at http://www.spinics.net/lists/kernel/msg2014819.html > about this patch: > > >Journal data which is written by jbd2 worker is left alone by > >this patch and will always be written out from the root cgroup. > > If the same was done for XFS, wouldn't this mean a malicious > process could still stall other processes' attempts to write > to the filesystem by performing arbitrary amounts of meta-data > modifications in a tight loop? XFS doesn't have journal driver writeback, so no. > >>After all, this functionality is the last piece of the > >>"isolation"-puzzle that is missing from Linux to actually > >>allow fencing off virtual machines or containers from DOSing > >>each other by using up all I/O bandwidth... > > > >Yes, I know, but no-one seems to care enough about it to provide > >regression tests for it. > > Well, I could give it a try, if a shell script tinkering with > control groups parameters (which requires root privileges and > could easily stall the machine) would be considered adequate for > the purpose. xfstests is where such tests need to live. It would need infrastructure to set up control groups and bandwidth limits... > I would propose a test to be performed like this: > > 0) Identify a block device to test on. I guess some artificially > speed-limited DM device would be best? > Set the speed limit to X/100 MB per second, with X configurable. xfstests provides a scratch device that can be used for this. > > 1) Start 4 "good" plus 4 "evil" subprocesses competing for > write-bandwidth on the block device. > Assign the 4 "good" processes to two different control groups ("g1", "g2"), > assign the 4 "evil" processes to further two different control > groups ("e1", "e2"), so 4 control groups in total, with 2 tasks each. > > 2) Create 3 different XFS filesystem instances on the block > device, one for access by only the "good" processes, > on for access by only the "evil" processes, one for > shared access by at least two "good" and two "evil" > processes. Why do you need multiple filesystems? The writeback throttling is designed to work within a single filesystem... I was thinking of something similar, but quite simple, using "bound" and "unbound" (i.e. limited and unlimited) processes. e.g: process 1 is unbound, does large sequential IO process 2-N are bound to 1MB/s, do large sequential IO Run for several minutes to reach a stable steady state behaviour. if process 2-N do not receive 1MB/s throughput each, then throttling of the unbound writeback processes is not working. Combinations of this test using different read/write streams on each process gives multiple tests, verifies that block IO control works for both read and write IO, not just writeback throttling. And then other combinations of this sort of test, such as also binding process 1 is bound to, say, 20MB/s. Repeating the tests can then tell us if fast and slow bindings are working correctly. i.e. checking to ensure that process 1 doesn't exceed it's limits and all the other streams stay within bounds, too. > 3) Behaviour of the processes: > > "Good" processes will attempt to write a configured amount > of data (X MB) at 20% of the speed limit of the block device, modifying > meta-data at a moderate rate (like creating/renaming/deleting files > every few megabytes written). > Half of the "good" processes write to their "good-only" filesystem, > the other half writes to the "shared access" filesystem. > > Half of the "evil" processes will attempt to write as much data > as possible into open files in a tight endless loop. > The other half of the "evil" processes will permanently > modify meta-data as quickly as possible, creating/renaming/deleting > lots of files, also in a tight endless loop. > Half of the "evil" processes writes to the "evil-only" filesystem, > the other half writes to the "shared access" filesystem. Metadata IO not throttled - it is owned by the filesystem and hence root cgroup. There is no point in running tests that do large amounts of journal/metadata IO as it is this will result in uncontrollable and unpredictabe IO patterns and hence will give unreliable test results. We want to test the data bandwidth control algorithms work appropriately in a controlled, repeatable environment. Throwing all sorts of uncontrollable IO at the device is a good /stress/ test, but it is not going to tell us anything useful in terms of correctness or reliably detect functional regressions. > 4) Test 1: Configure all 4 control groups to allow for the same > buffered write rate percentage. > > The test is successful if all "good processes" terminate successfully > after a time not longer than it would take to write 10 times X MB to the > rate-limited block device. if we are rate limiting to 1MB/s, then a 10s test is not long enough to reach steady state. Indeed, it's going to take at least 30s worth of IO to guarantee that we getting writeback occurring for low bandwidth streams.... i.e. the test needs to run for a period of time and then measure the throughput of each stream, comparing it against the expected throughput for the stream, rather than trying to write a fixed bandwidth.... > 5) Test 2: Configure "e1" and "e2" to allow for "zero" buffered write rate. > > The test is successful if the "good processes" terminate successfully > after a time not longer than it would take to write 5 times X MB to the > rate-limited block device. > > All processes to be killed after termination of all good processes or > some timeout. If the timeout is reached, the test is failed. > > 6) Cleanup: unmount test filesystems, remove rate-limited DM device, remove > control groups. control group cleanup will need to be added to the xfstests infrastructure, but it handles everything else... > What do you think, could this be a reasonable plan? Yes, I think we can pull a reasonable set of baseline tests from an approach like this. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting? 2015-11-25 21:35 ` Dave Chinner @ 2015-11-29 21:41 ` Lutz Vieweg 2015-11-30 23:44 ` Dave Chinner 2015-12-01 8:38 ` automatic testing of cgroup writeback limiting (was: Re: Does XFS support cgroup writeback limiting?) Martin Steigerwald 2015-12-01 11:01 ` I/O 'owner' DoS probs (was Re: Does XFS support cgroup writeback limiting?) L.A. Walsh 1 sibling, 2 replies; 14+ messages in thread From: Lutz Vieweg @ 2015-11-29 21:41 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On 11/25/2015 10:35 PM, Dave Chinner wrote: >> 2) Create 3 different XFS filesystem instances on the block >> device, one for access by only the "good" processes, >> on for access by only the "evil" processes, one for >> shared access by at least two "good" and two "evil" >> processes. > > Why do you need multiple filesystems? The writeback throttling is > designed to work within a single filesystem... Hmm. Previously, I thought that the limiting of buffered writes was realized by keeping track of the owners of dirty pages, and that filesystem support was just required to make sure that writing via a filesystem did not "anonymize" the dirty data. From what I had read in blkio-controller.txt it seemed evident that limitations would be accounted for "per block device", not "per filesystem", and options like > echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device document how to configure limits per block device. Now after reading through the new Writeback section of blkio-controller.txt again I am somewhat confused - the text states > writeback operates on inode basis and if that means inodes as in "file system inodes", this would indeed mean limits would be enforced "per filesystem" - and yet there are no options documented to specify limits for any specific filesystem. Does this mean some process writing to a block device (not via filesystem) without "O_DIRECT" will dirty buffer pages, but those will not be limited (as they are neither synchronous nor via-filesystem writes)? That would mean VMs sharing some (physical or abstract) block device could not really be isolated regarding their asynchronous write I/O... > Metadata IO not throttled - it is owned by the filesystem and hence > root cgroup. Ouch. That kind of defeats the purpose of limiting evil processes' ability to DOS other processes. Wouldn't it be possible to assign some arbitrary cost to meta-data operations - like "account one page write for each meta-data change to the originating process of that change"? While certainly not allowing for limiting to byte-precise limits of write bandwidth, this would regain the ability to defend against DOS situations, and for well-behaved processes, the "cost" accounted for their not-so-frequent meta-data operations would probably not really hurt their writing speed. >> The test is successful if all "good processes" terminate successfully >> after a time not longer than it would take to write 10 times X MB to the >> rate-limited block device. > > if we are rate limiting to 1MB/s, then a 10s test is not long enough > to reach steady state. Indeed, it's going to take at least 30s worth > of IO to guarantee that we getting writeback occurring for low > bandwidth streams.... Sure, the "X/100 MB per second" throttle to the scratch device was meant to result in a minimal test time of > 100s. > i.e. the test needs to run for a period of time and then measure > the throughput of each stream, comparing it against the expected > throughput for the stream, rather than trying to write a fixed > bandwidth.... The reason why I thought it to be a good idea to have the "good" processes use only a limited write rate was to make sure that the actual write activity of those processes is spread out over enough time to make sure that they could, after all, feel some "pressure back" from the operating system that is applied only after the "bad" processes have filled up all RAM dedicated to dirty buffer cache. Assume the test instance has lots of memory and would be willing to spend many Gigabytes of RAM for dirty buffer caches. Chances are that in such a situation the "good" processes might be done writing their limited amount of data almost instantaneously, because the data just went to RAM. (I understand that if one used the absolute "blkio.throttle.write*" options pressure back could apply before the dirty buffer cache was maxed out, but in real-world scenarios people will almost always use the relative "blkio.weight" based limiting, after all, you usually don't want to throttle processes if there is plenty of bandwidth left no other process wants at the same time.) Regards, Lutz Vieweg _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Does XFS support cgroup writeback limiting? 2015-11-29 21:41 ` Lutz Vieweg @ 2015-11-30 23:44 ` Dave Chinner 2015-12-01 8:38 ` automatic testing of cgroup writeback limiting (was: Re: Does XFS support cgroup writeback limiting?) Martin Steigerwald 1 sibling, 0 replies; 14+ messages in thread From: Dave Chinner @ 2015-11-30 23:44 UTC (permalink / raw) To: Lutz Vieweg; +Cc: xfs On Sun, Nov 29, 2015 at 10:41:13PM +0100, Lutz Vieweg wrote: > On 11/25/2015 10:35 PM, Dave Chinner wrote: > >>2) Create 3 different XFS filesystem instances on the block > >> device, one for access by only the "good" processes, > >> on for access by only the "evil" processes, one for > >> shared access by at least two "good" and two "evil" > >> processes. > > > >Why do you need multiple filesystems? The writeback throttling is > >designed to work within a single filesystem... > > Hmm. Previously, I thought that the limiting of buffered writes > was realized by keeping track of the owners of dirty pages, and > that filesystem support was just required to make sure that writing > via a filesystem did not "anonymize" the dirty data. From what > I had read in blkio-controller.txt it seemed evident that limitations > would be accounted for "per block device", not "per filesystem", and > options like > >echo "<major>:<minor> <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device > document how to configure limits per block device. > > Now after reading through the new Writeback section of blkio-controller.txt > again I am somewhat confused - the text states > >writeback operates on inode basis > and if that means inodes as in "file system inodes", this would > indeed mean limits would be enforced "per filesystem" - and yet > there are no options documented to specify limits for any specific > filesystem. > > Does this mean some process writing to a block device (not via filesystem) > without "O_DIRECT" will dirty buffer pages, but those will not be limited > (as they are neither synchronous nor via-filesystem writes)? > That would mean VMs sharing some (physical or abstract) block device could > not really be isolated regarding their asynchronous write I/O... You are asking the wrong person - I don't know how this is all supposed to work, how it's supposed to be configured, how different cgroup controllers are supposed to interact, etc. Hence my request for regression tests before we say "XFS supports ...." because without them I have no idea if something is desired/correct behaviour or not... > >Metadata IO not throttled - it is owned by the filesystem and hence > >root cgroup. > > Ouch. That kind of defeats the purpose of limiting evil processes' > ability to DOS other processes. > Wouldn't it be possible to assign some arbitrary cost to meta-data > operations - like "account one page write for each meta-data change No. Think of a file with millions of extents. Just reading a byte of data will require pulling the entire extent map into memory, and so doing hundreds of megabytes of IO and using that much memory. > to the originating process of that change"? While certainly not > allowing for limiting to byte-precise limits of write bandwidth, > this would regain the ability to defend against DOS situations, No, that won't help at all. In fact, it might even introduce new DOS situations where we block a global metadata operation because it doesn't have reservation space and so *everything* stops until that metadata IO is dispatched and completed... > Assume the test instance has lots of memory and would be willing to > spend many Gigabytes of RAM for dirty buffer caches. Most people will be running the tests on machines with limited RAM and disk space, and so the tests really cannot depend on having many multiple gigabytes of RAM available for correct operation.... Indeed, limiting dirty memory thresholds will be part of setting up a predictable, reliable test scenario (e.g. via /proc/sys/vm/dirty_bytes and friends). > (I understand that if one used the absolute "blkio.throttle.write*" options > pressure back could apply before the dirty buffer cache was maxed out, > but in real-world scenarios people will almost always use the relative > "blkio.weight" based limiting, after all, you usually don't want to throttle > processes if there is plenty of bandwidth left no other process wants > at the same time.) Again, I have no idea how the throttling works or is configured, so we need regression tests to cover both (all?) of these sorts of common configuration scenarios. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* automatic testing of cgroup writeback limiting (was: Re: Does XFS support cgroup writeback limiting?) 2015-11-29 21:41 ` Lutz Vieweg 2015-11-30 23:44 ` Dave Chinner @ 2015-12-01 8:38 ` Martin Steigerwald 2015-12-01 16:38 ` Tejun Heo 1 sibling, 1 reply; 14+ messages in thread From: Martin Steigerwald @ 2015-12-01 8:38 UTC (permalink / raw) To: xfs, Tejun Heo; +Cc: linux-fsdevel, Lutz Vieweg I think it makes sense to include those that wrote the cgroup writeback limiting into the discussion on how to automatically test this feature. Thus add fsdevel and Tejun to CC. Am Sonntag, 29. November 2015, 22:41:13 CET schrieb Lutz Vieweg: > On 11/25/2015 10:35 PM, Dave Chinner wrote: > >> 2) Create 3 different XFS filesystem instances on the block > >> > >> device, one for access by only the "good" processes, > >> on for access by only the "evil" processes, one for > >> shared access by at least two "good" and two "evil" > >> processes. > > > > Why do you need multiple filesystems? The writeback throttling is > > designed to work within a single filesystem... > > Hmm. Previously, I thought that the limiting of buffered writes > was realized by keeping track of the owners of dirty pages, and > that filesystem support was just required to make sure that writing > via a filesystem did not "anonymize" the dirty data. From what > I had read in blkio-controller.txt it seemed evident that limitations > would be accounted for "per block device", not "per filesystem", and > options like > > > echo "<major>:<minor> <rate_bytes_per_second>" > > > /cgrp/blkio.throttle.read_bps_device > document how to configure limits per block device. > > Now after reading through the new Writeback section of blkio-controller.txt > again I am somewhat confused - the text states > > > writeback operates on inode basis > > and if that means inodes as in "file system inodes", this would > indeed mean limits would be enforced "per filesystem" - and yet > there are no options documented to specify limits for any specific > filesystem. > > Does this mean some process writing to a block device (not via filesystem) > without "O_DIRECT" will dirty buffer pages, but those will not be limited > (as they are neither synchronous nor via-filesystem writes)? > That would mean VMs sharing some (physical or abstract) block device could > not really be isolated regarding their asynchronous write I/O... > > > Metadata IO not throttled - it is owned by the filesystem and hence > > root cgroup. > > Ouch. That kind of defeats the purpose of limiting evil processes' > ability to DOS other processes. > Wouldn't it be possible to assign some arbitrary cost to meta-data > operations - like "account one page write for each meta-data change > to the originating process of that change"? While certainly not > allowing for limiting to byte-precise limits of write bandwidth, > this would regain the ability to defend against DOS situations, > and for well-behaved processes, the "cost" accounted for their > not-so-frequent meta-data operations would probably not really hurt their > writing > speed. > > >> The test is successful if all "good processes" terminate successfully > >> > >> after a time not longer than it would take to write 10 times X MB to > >> the > >> rate-limited block device. > > > > if we are rate limiting to 1MB/s, then a 10s test is not long enough > > to reach steady state. Indeed, it's going to take at least 30s worth > > of IO to guarantee that we getting writeback occurring for low > > bandwidth streams.... > > Sure, the "X/100 MB per second" throttle to the scratch device > was meant to result in a minimal test time of > 100s. > > > i.e. the test needs to run for a period of time and then measure > > the throughput of each stream, comparing it against the expected > > throughput for the stream, rather than trying to write a fixed > > bandwidth.... > > The reason why I thought it to be a good idea to have the "good" processes > use only a limited write rate was to make sure that the actual write > activity of those processes is spread out over enough time to make sure > that they could, after all, feel some "pressure back" from the operating > system that is applied only after the "bad" processes have filled up > all RAM dedicated to dirty buffer cache. > > Assume the test instance has lots of memory and would be willing to > spend many Gigabytes of RAM for dirty buffer caches. Chances are that > in such a situation the "good" processes might be done writing their > limited amount of data almost instantaneously, because the data just > went to RAM. > > (I understand that if one used the absolute "blkio.throttle.write*" options > pressure back could apply before the dirty buffer cache was maxed out, > but in real-world scenarios people will almost always use the relative > "blkio.weight" based limiting, after all, you usually don't want to throttle > processes if there is plenty of bandwidth left no other process wants at > the same time.) > > > Regards, > > Lutz Vieweg > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: automatic testing of cgroup writeback limiting (was: Re: Does XFS support cgroup writeback limiting?) 2015-12-01 8:38 ` automatic testing of cgroup writeback limiting (was: Re: Does XFS support cgroup writeback limiting?) Martin Steigerwald @ 2015-12-01 16:38 ` Tejun Heo 2015-12-03 0:18 ` automatic testing of cgroup writeback limiting Lutz Vieweg 0 siblings, 1 reply; 14+ messages in thread From: Tejun Heo @ 2015-12-01 16:38 UTC (permalink / raw) To: Martin Steigerwald; +Cc: linux-fsdevel, Lutz Vieweg, xfs Hello, On Tue, Dec 01, 2015 at 09:38:03AM +0100, Martin Steigerwald wrote: > > > echo "<major>:<minor> <rate_bytes_per_second>" > > > > /cgrp/blkio.throttle.read_bps_device > > document how to configure limits per block device. > > > > Now after reading through the new Writeback section of blkio-controller.txt > > again I am somewhat confused - the text states > > > > > writeback operates on inode basis As opposed to pages. cgroup ownership is tracked per inode, not per page, so if multiple cgroups write to the same inode at the same time, some IOs will be incorrectly attributed. > > and if that means inodes as in "file system inodes", this would > > indeed mean limits would be enforced "per filesystem" - and yet > > there are no options documented to specify limits for any specific > > filesystem. cgroup ownership is per-inode. IO throttling is per-device, so as long as multiple filesystems map to the same device, they fall under the same limit. > > > Metadata IO not throttled - it is owned by the filesystem and hence > > > root cgroup. > > > > Ouch. That kind of defeats the purpose of limiting evil processes' > > ability to DOS other processes. cgroup isn't a security mechanism and has to make active tradeoffs between isolation and overhead. It doesn't provide protection against malicious users and in general it's a pretty bad idea to depend on cgroup for protection against hostile entities. Although some controller do better isolation than others, given how filesystems are implemented, filesystem io control getting there will likely take a while. > > Wouldn't it be possible to assign some arbitrary cost to meta-data > > operations - like "account one page write for each meta-data change > > to the originating process of that change"? While certainly not > > allowing for limiting to byte-precise limits of write bandwidth, > > this would regain the ability to defend against DOS situations, > > and for well-behaved processes, the "cost" accounted for their > > not-so-frequent meta-data operations would probably not really hurt their > > writing > > speed. For aggregate consumers, this sort of approaches does make sense - measure total consumption by common operations and distribute the charges afterwards; however, this will require quite a bit of work on both io controller and filesystem sides. > > (I understand that if one used the absolute "blkio.throttle.write*" options > > pressure back could apply before the dirty buffer cache was maxed out, > > but in real-world scenarios people will almost always use the relative > > "blkio.weight" based limiting, after all, you usually don't want to throttle > > processes if there is plenty of bandwidth left no other process wants at > > the same time.) I'd recommend configuring both memory.high and io.weight so that the buffer area isn't crazy high compared to io bandwidth. It should be able to reach the configured ratio that way and also avoids two io domains competing in the same io domain which can skew the results. Thanks. -- tejun _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: automatic testing of cgroup writeback limiting 2015-12-01 16:38 ` Tejun Heo @ 2015-12-03 0:18 ` Lutz Vieweg 2015-12-03 15:38 ` Tejun Heo 0 siblings, 1 reply; 14+ messages in thread From: Lutz Vieweg @ 2015-12-03 0:18 UTC (permalink / raw) To: Tejun Heo, Martin Steigerwald; +Cc: linux-fsdevel, xfs On 12/01/2015 05:38 PM, Tejun Heo wrote: > As opposed to pages. cgroup ownership is tracked per inode, not per > page, so if multiple cgroups write to the same inode at the same time, > some IOs will be incorrectly attributed. I can't think of use cases where this could become a problem. If more than one user/container/VM is allowed to write to the same file at any one time, isolation is probably absent anyway ;-) > cgroup ownership is per-inode. IO throttling is per-device, so as > long as multiple filesystems map to the same device, they fall under > the same limit. Good, that's why I assumed it useful to include a scenario with more than one filesystem on the same device into the test scenario, just to know whether there are unexpected issues if more than one filesystem utilizes the same underlying device. >>>> Metadata IO not throttled - it is owned by the filesystem and hence >>>> root cgroup. >>> >>> Ouch. That kind of defeats the purpose of limiting evil processes' >>> ability to DOS other processes. > > cgroup isn't a security mechanism and has to make active tradeoffs > between isolation and overhead. It doesn't provide protection against > malicious users and in general it's a pretty bad idea to depend on > cgroup for protection against hostile entities. I wrote of "evil" processes for simplicity, but 99 out of 100 times it's not intentional "evilness" that makes a process exhaust I/O bandwidth of some device shared with other users/containers/VMs, it's usually just bugs, inconsiderate programming or inappropriate use that makes one process write like crazy, making other users/containers/VMs suffer. Whereever strict service level guarantees are relevant, and applications require writing to storage, you currently cannot consolidate two or more applications onto the same physical host, even if they run under separate users/containers/VMs. I understand there is no short or medium term solution that would allow to isolate processes writing to the same filesytem (because of the meta data writing), but is it correct to say that at least VMs, which do not allow the virtual guest to cause extensive meta data writes on the physical host, only writes into pre-allocated image files, can be safely isolated by the new "buffered write accounting"? If so, we'd have stay away from user or container based isolation of independently SLA'd applications, but could at least resort to VMs using image files on a shared filesystem. Regards, Lutz Vieweg _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: automatic testing of cgroup writeback limiting 2015-12-03 0:18 ` automatic testing of cgroup writeback limiting Lutz Vieweg @ 2015-12-03 15:38 ` Tejun Heo 0 siblings, 0 replies; 14+ messages in thread From: Tejun Heo @ 2015-12-03 15:38 UTC (permalink / raw) To: Lutz Vieweg; +Cc: linux-fsdevel, xfs Hello, Lutz. On Thu, Dec 03, 2015 at 01:18:48AM +0100, Lutz Vieweg wrote: > On 12/01/2015 05:38 PM, Tejun Heo wrote: > >As opposed to pages. cgroup ownership is tracked per inode, not per > >page, so if multiple cgroups write to the same inode at the same time, > >some IOs will be incorrectly attributed. > > I can't think of use cases where this could become a problem. > If more than one user/container/VM is allowed to write to the > same file at any one time, isolation is probably absent anyway ;-) Yeap, that's why the trade-off was made. > >cgroup ownership is per-inode. IO throttling is per-device, so as > >long as multiple filesystems map to the same device, they fall under > >the same limit. > > Good, that's why I assumed it useful to include a scenario with more > than one filesystem on the same device into the test scenario, just > to know whether there are unexpected issues if more than one filesystem > utilizes the same underlying device. Sure, I'd recommend including multiple writers on a single filesystem case too as that exposes entanglement in metadata handling. That should expose problems in more places. > I wrote of "evil" processes for simplicity, but 99 out of 100 times > it's not intentional "evilness" that makes a process exhaust I/O > bandwidth of some device shared with other users/containers/VMs, it's > usually just bugs, inconsiderate programming or inappropriate use > that makes one process write like crazy, making other > users/containers/VMs suffer. Right now, what cgroup writeback can control is well-behaving workloads which aren't dominated by metadata writeback. We still have ways to go but it still is a huge leap compared to what we had before. > Whereever strict service level guarantees are relevant, and > applications require writing to storage, you currently cannot > consolidate two or more applications onto the same physical host, > even if they run under separate users/containers/VMs. You're right. It can't do isolation well enough for things like strict service level guarantee. > I understand there is no short or medium term solution that > would allow to isolate processes writing to the same filesytem > (because of the meta data writing), but is it correct to say > that at least VMs, which do not allow the virtual guest to > cause extensive meta data writes on the physical host, only > writes into pre-allocated image files, can be safely isolated > by the new "buffered write accounting"? Sure, that or loop mounts. Pure data accesses should be fairly well isolated. Thanks. -- tejun _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* I/O 'owner' DoS probs (was Re: Does XFS support cgroup writeback limiting?) 2015-11-25 21:35 ` Dave Chinner 2015-11-29 21:41 ` Lutz Vieweg @ 2015-12-01 11:01 ` L.A. Walsh 2015-12-01 20:18 ` Dave Chinner 1 sibling, 1 reply; 14+ messages in thread From: L.A. Walsh @ 2015-12-01 11:01 UTC (permalink / raw) To: Dave Chinner, xfs Dave Chinner wrote: > Metadata IO not throttled - it is owned by the filesystem and hence > root cgroup. ---- Please forgive me if this is "obvious".... If it is owned by the file system, then logically, such meta data wouldn't count against the user's quota limits? That 'could' be justified on the basis that the format of the metadata is outside the control of file owner and depending on what's supported, included or logged in metadata, the size could vary greatly -- all outside of the direct control of the user. However, by adding extended attributes, the user can, likely, affect the disk space used by the metadata regardless of how it implemented and stored on disk -- so in that respect it is user data. Similarly, system CPU time is attributed to the user causing the CPU time to be spent, and so on. To highlight the type of problems this can cause: starting after Windows XP, MS decided to combine network file I/O requests over 64K to be dispatched to the 'System' daemon to optimize network I/O from multiple processes using bigger windows to try to keep the network pipe as full as possible (competing, of course, with QOS for conforming apps). This created a problem -- applications that flood the network sit mostly idle waiting for their rpc call to finish and all their BW is attributed to System. FWIW -- you cannot change any of the priorities for System without it going all blue screen on you telling you that it's crashing to protect you -- because something changed one of it's priority number (IO/cpu/memory...etc). Problem is that any intensive prioritizing slows everything down and the entire computer becomes unresponsive (10Gb link) as network requests are reduced in size. Usually it is possible to restore normality by restarting Explorer (assuming you have enough cpu time to do so). About 10% of the time, pulling the network cord is required to stop the storm, and VERY rarely, <.01% of the time power-cycling is required. Attributing user-generated I/O to system processes that are "unaccountable" *can* and does cause DoS "opportunities".... Is this metadata really I/O that is completely disconnected from a user that they cannot affect? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: I/O 'owner' DoS probs (was Re: Does XFS support cgroup writeback limiting?) 2015-12-01 11:01 ` I/O 'owner' DoS probs (was Re: Does XFS support cgroup writeback limiting?) L.A. Walsh @ 2015-12-01 20:18 ` Dave Chinner 0 siblings, 0 replies; 14+ messages in thread From: Dave Chinner @ 2015-12-01 20:18 UTC (permalink / raw) To: L.A. Walsh; +Cc: xfs On Tue, Dec 01, 2015 at 03:01:23AM -0800, L.A. Walsh wrote: > > > Dave Chinner wrote: > >Metadata IO not throttled - it is owned by the filesystem and hence > >root cgroup. > ---- > Please forgive me if this is "obvious".... > > If it is owned by the file system, then logically, such meta > data wouldn't count against the user's quota limits? Filesystem space usage accounting is not the same thing. It's just accounting, and has nothign to do with how the metadata is dependent on the transaction subsystem, caching, locking, parent/child relationships with other metadata in the filesystem, etc. > Is this metadata really I/O that is completely disconnected from > a user that they cannot affect? throttling metadata IO - read or write - can lead to entire filesystem stalls/deadlocks. e.g. transaction reservation cannot complete because the inode that pins the tail of the log is blocked from being written due to the process that needs to write it being throttled by the IO controller. Hence every transaction in the filesytem stalls until the IO controller allows the critical metadata IO to be dispatched and completed... Or, alternatively, a process is trying to allocate a block, and holds an AG locked. It then tries to read a btree block, which is throttled and blocked for some time. Any other allocation that needs to take place in that AG is now blocked until the first allocation is completed. i.e. the moment we start throttling global metadata IO based on per-process/cgroup limits, we end up with priority inversions and partial/complete fs stalls all over the place. You can educate/ban stupid users, but we can't easily iprevent stalls due to IO level priority inversions we have no control over at the filesystem level... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2015-12-03 15:38 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-11-23 11:05 Does XFS support cgroup writeback limiting? Lutz Vieweg 2015-11-23 20:26 ` Dave Chinner 2015-11-23 22:08 ` Lutz Vieweg 2015-11-23 23:20 ` Dave Chinner 2015-11-25 18:28 ` Lutz Vieweg 2015-11-25 21:35 ` Dave Chinner 2015-11-29 21:41 ` Lutz Vieweg 2015-11-30 23:44 ` Dave Chinner 2015-12-01 8:38 ` automatic testing of cgroup writeback limiting (was: Re: Does XFS support cgroup writeback limiting?) Martin Steigerwald 2015-12-01 16:38 ` Tejun Heo 2015-12-03 0:18 ` automatic testing of cgroup writeback limiting Lutz Vieweg 2015-12-03 15:38 ` Tejun Heo 2015-12-01 11:01 ` I/O 'owner' DoS probs (was Re: Does XFS support cgroup writeback limiting?) L.A. Walsh 2015-12-01 20:18 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox