* Bandwidth Allocations under CFQ I/O Scheduler @ 2006-10-16 20:46 Phetteplace, Thad (GE Healthcare, consultant) 2006-10-17 1:24 ` Arjan van de Ven 0 siblings, 1 reply; 26+ messages in thread From: Phetteplace, Thad (GE Healthcare, consultant) @ 2006-10-16 20:46 UTC (permalink / raw) To: linux-kernel The I/O priority levels available under the CFQ scheduler are nice (no pun in intended), but I remember some talk back when they first went in that future versions might include bandwidth allocations in addition to the 'niceness' style. Is anyone out there working on that? If not, I'm willing to hack up a proof of concept... I just wan't to make sure I'm not reinventing the wheel. Thanks, Thad Phetteplace ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-16 20:46 Bandwidth Allocations under CFQ I/O Scheduler Phetteplace, Thad (GE Healthcare, consultant) @ 2006-10-17 1:24 ` Arjan van de Ven 2006-10-17 13:23 ` Jens Axboe 0 siblings, 1 reply; 26+ messages in thread From: Arjan van de Ven @ 2006-10-17 1:24 UTC (permalink / raw) To: Phetteplace, Thad (GE Healthcare, consultant); +Cc: linux-kernel On Mon, 2006-10-16 at 16:46 -0400, Phetteplace, Thad (GE Healthcare, consultant) wrote: > The I/O priority levels available under the CFQ scheduler are > nice (no pun in intended), but I remember some talk back when > they first went in that future versions might include bandwidth > allocations in addition to the 'niceness' style. Is anyone out > there working on that? If not, I'm willing to hack up a proof > of concept... I just wan't to make sure I'm not reinventing > the wheel. Hi, it's a nice idea in theory. However... since IO bandwidth for seeks is about 1% to 3% of that of sequential IO (on disks at least), which bandwidth do you want to allocate? "worst case" you need to use the all-seeks bandwidth, but that's so far away from "best case" that it may well not be relevant in practice. Yet there are real world cases where for a period of time you approach worst case behavior ;( Greetings, Arjan van de Ven -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-17 1:24 ` Arjan van de Ven @ 2006-10-17 13:23 ` Jens Axboe 2006-10-17 14:37 ` Ric Wheeler ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Jens Axboe @ 2006-10-17 13:23 UTC (permalink / raw) To: Arjan van de Ven Cc: Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Tue, Oct 17 2006, Arjan van de Ven wrote: > On Mon, 2006-10-16 at 16:46 -0400, Phetteplace, Thad (GE Healthcare, > consultant) wrote: > > The I/O priority levels available under the CFQ scheduler are > > nice (no pun in intended), but I remember some talk back when > > they first went in that future versions might include bandwidth > > allocations in addition to the 'niceness' style. Is anyone out > > there working on that? If not, I'm willing to hack up a proof > > of concept... I just wan't to make sure I'm not reinventing > > the wheel. > > > Hi, > > it's a nice idea in theory. However... since IO bandwidth for seeks is > about 1% to 3% of that of sequential IO (on disks at least), which > bandwidth do you want to allocate? "worst case" you need to use the > all-seeks bandwidth, but that's so far away from "best case" that it may > well not be relevant in practice. Yet there are real world cases where > for a period of time you approach worst case behavior ;( Bandwidth reservation would have to be confined to special cases, you obviously cannot do it "in general" for the reasons Arjan lists above. So you absolutely have to limit any meta data io that would cause seeks, and the file in question would have to be laid out in a closely sequential fashion. As long as the access pattern generated by the app asking for reservation is largely sequential, the kernel can do whatever it needs to help you maintain the required bandwidth. On a per-file basis the bandwidth reservation should be doable, to the extent that generic hardware allows. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-17 13:23 ` Jens Axboe @ 2006-10-17 14:37 ` Ric Wheeler 2006-10-17 14:47 ` Jens Axboe 2006-10-17 14:46 ` Phetteplace, Thad (GE Healthcare, consultant) 2006-10-18 8:00 ` Jakob Oestergaard 2 siblings, 1 reply; 26+ messages in thread From: Ric Wheeler @ 2006-10-17 14:37 UTC (permalink / raw) To: Jens Axboe Cc: Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel Jens Axboe wrote: > On Tue, Oct 17 2006, Arjan van de Ven wrote: > >>On Mon, 2006-10-16 at 16:46 -0400, Phetteplace, Thad (GE Healthcare, >>consultant) wrote: >> >>>The I/O priority levels available under the CFQ scheduler are >>>nice (no pun in intended), but I remember some talk back when >>>they first went in that future versions might include bandwidth >>>allocations in addition to the 'niceness' style. Is anyone out >>>there working on that? If not, I'm willing to hack up a proof >>>of concept... I just wan't to make sure I'm not reinventing >>>the wheel. >> >> >>Hi, >> >>it's a nice idea in theory. However... since IO bandwidth for seeks is >>about 1% to 3% of that of sequential IO (on disks at least), which >>bandwidth do you want to allocate? "worst case" you need to use the >>all-seeks bandwidth, but that's so far away from "best case" that it may >>well not be relevant in practice. Yet there are real world cases where >>for a period of time you approach worst case behavior ;( > > > Bandwidth reservation would have to be confined to special cases, you > obviously cannot do it "in general" for the reasons Arjan lists above. > So you absolutely have to limit any meta data io that would cause seeks, > and the file in question would have to be laid out in a closely > sequential fashion. As long as the access pattern generated by the app > asking for reservation is largely sequential, the kernel can do whatever > it needs to help you maintain the required bandwidth. > > On a per-file basis the bandwidth reservation should be doable, to the > extent that generic hardware allows. I agree - bandwidth allocation is really tricky to do in a useful way. On one hand, you could "time slice" the disk with some large quanta as we would do with a CPU to get some reasonably useful allocation for competing, streaming workloads. On the other hand, this kind of thing would kill latency if/when you hit any synchronous writes (or cold reads). One other possible use for allocation is throttling a background workload (say, an interative checker for a file system or some such thing) where the workload can run effectively forever, but should be contained to not interfere with foreground workloads. A similar time slice might be used to throttle this load done unless there is no competing work to be done. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-17 14:37 ` Ric Wheeler @ 2006-10-17 14:47 ` Jens Axboe 0 siblings, 0 replies; 26+ messages in thread From: Jens Axboe @ 2006-10-17 14:47 UTC (permalink / raw) To: Ric Wheeler Cc: Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Tue, Oct 17 2006, Ric Wheeler wrote: > Jens Axboe wrote: > >On Tue, Oct 17 2006, Arjan van de Ven wrote: > > > >>On Mon, 2006-10-16 at 16:46 -0400, Phetteplace, Thad (GE Healthcare, > >>consultant) wrote: > >> > >>>The I/O priority levels available under the CFQ scheduler are > >>>nice (no pun in intended), but I remember some talk back when > >>>they first went in that future versions might include bandwidth > >>>allocations in addition to the 'niceness' style. Is anyone out > >>>there working on that? If not, I'm willing to hack up a proof > >>>of concept... I just wan't to make sure I'm not reinventing > >>>the wheel. > >> > >> > >>Hi, > >> > >>it's a nice idea in theory. However... since IO bandwidth for seeks is > >>about 1% to 3% of that of sequential IO (on disks at least), which > >>bandwidth do you want to allocate? "worst case" you need to use the > >>all-seeks bandwidth, but that's so far away from "best case" that it may > >>well not be relevant in practice. Yet there are real world cases where > >>for a period of time you approach worst case behavior ;( > > > > > >Bandwidth reservation would have to be confined to special cases, you > >obviously cannot do it "in general" for the reasons Arjan lists above. > >So you absolutely have to limit any meta data io that would cause seeks, > >and the file in question would have to be laid out in a closely > >sequential fashion. As long as the access pattern generated by the app > >asking for reservation is largely sequential, the kernel can do whatever > >it needs to help you maintain the required bandwidth. > > > >On a per-file basis the bandwidth reservation should be doable, to the > >extent that generic hardware allows. > > I agree - bandwidth allocation is really tricky to do in a useful way. > > On one hand, you could "time slice" the disk with some large quanta as > we would do with a CPU to get some reasonably useful allocation for > competing, streaming workloads. > > On the other hand, this kind of thing would kill latency if/when you hit > any synchronous writes (or cold reads). That's pretty close to the way that CFQ already operates. You need time slices long enough to make the initial seek neglible, but short enough to make the latencies nice. A tradeoff, of course. > One other possible use for allocation is throttling a background > workload (say, an interative checker for a file system or some such > thing) where the workload can run effectively forever, but should be > contained to not interfere with foreground workloads. A similar time > slice might be used to throttle this load done unless there is no > competing work to be done. That'd be the idle io class. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-17 13:23 ` Jens Axboe 2006-10-17 14:37 ` Ric Wheeler @ 2006-10-17 14:46 ` Phetteplace, Thad (GE Healthcare, consultant) 2006-10-18 8:00 ` Jakob Oestergaard 2 siblings, 0 replies; 26+ messages in thread From: Phetteplace, Thad (GE Healthcare, consultant) @ 2006-10-17 14:46 UTC (permalink / raw) To: Jens Axboe, Arjan van de Ven; +Cc: linux-kernel Jens Axboe wrote: > Arjan van de Ven wrote: > > > > it's a nice idea in theory. However... since IO bandwidth for seeks is > > about 1% to 3% of that of sequential IO (on disks at least), which > > bandwidth do you want to allocate? "worst case" you need to use the > > all-seeks bandwidth, but that's so far away from "best case" that it > > may well not be relevant in practice. Yet there are real world cases > > where for a period of time you approach worst case behavior ;( > > Bandwidth reservation would have to be confined to special cases, you > obviously cannot do it "in general" for the reasons Arjan lists above. > So you absolutely have to limit any meta data io that would cause seeks, > and the file in question would have to be laid out in a closely > sequential fashion. As long as the access pattern generated by the app > asking for reservation is largely sequential, the kernel can do whatever > it needs to help you maintain the required bandwidth. > > On a per-file basis the bandwidth reservation should be doable, to the > extent that generic hardware allows. I see bandwidth allocations coming in two flavors: floors and ceilings. Floors (a guaranteed minimum) are indeed problematic because of the danger of over-allocating bandwidth. Seek latency reducing your total available bandwidth in non-deterministic ways only complicates the issue. Ceilings are easier, as we are simply capping utilization even when excess capacity is available. Of course floors is probably what most people are thinking of when they talk about allocations, but ceilings have their place also. In an embedded environment where very deterministic behavior is the goal, I/O ceilings could be useful. Also, it could be useful for emulation of legacy hardware performance, perhaps for regression testing or some such (admittedly an edge case). If you over allocate bandwidth on a resource, the bandwidth allocation would probably fall back to something more like the 'niceness' model (with the higher bandwidth procs running with higher priority). The only real change then is the enforcing of bandwidth ceilings. This is probably not very useful in the general case (your main operating system drives with many users/processes reading and writing), but it can be very useful for managing the behavior of a limited set of apps with excusive access to a drive. There is a body of knowledge in the ISP/routing world we can draw on here, though they don't have the same latency issues. Later, Thad Phetteplace ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-17 13:23 ` Jens Axboe 2006-10-17 14:37 ` Ric Wheeler 2006-10-17 14:46 ` Phetteplace, Thad (GE Healthcare, consultant) @ 2006-10-18 8:00 ` Jakob Oestergaard 2006-10-18 9:40 ` Arjan van de Ven 2006-10-18 9:51 ` Jens Axboe 2 siblings, 2 replies; 26+ messages in thread From: Jakob Oestergaard @ 2006-10-18 8:00 UTC (permalink / raw) To: Jens Axboe Cc: Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Tue, Oct 17, 2006 at 03:23:13PM +0200, Jens Axboe wrote: > On Tue, Oct 17 2006, Arjan van de Ven wrote: ... > > Hi, > > > > it's a nice idea in theory. However... since IO bandwidth for seeks is > > about 1% to 3% of that of sequential IO (on disks at least), which > > bandwidth do you want to allocate? "worst case" you need to use the > > all-seeks bandwidth, but that's so far away from "best case" that it may > > well not be relevant in practice. Yet there are real world cases where > > for a period of time you approach worst case behavior ;( > > Bandwidth reservation would have to be confined to special cases, you > obviously cannot do it "in general" for the reasons Arjan lists above. How about allocating I/O operations instead of bandwidth ? So, any read is really a seek+read, and we count that as one I/O operation. Same for writes. Since the total "capacity" of the system is typically (in real-world scenarios) the number of operations (seek+X) rather than the raw sequential bandwidth anyway, I suppose that I/O operations would be what you wanted to allocate anyway. Anyway, just a thought... (And if you're thinking one sequential reader/writer could then starve the system; well, count every 256KiB of data to read/write as a seperate I/O operation even though no seek is needed. That would very roughly match the raw read/write performance with the seek performance) -- / jakob ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 8:00 ` Jakob Oestergaard @ 2006-10-18 9:40 ` Arjan van de Ven 2006-10-18 11:30 ` Jakob Oestergaard 2006-10-18 9:51 ` Jens Axboe 1 sibling, 1 reply; 26+ messages in thread From: Arjan van de Ven @ 2006-10-18 9:40 UTC (permalink / raw) To: Jakob Oestergaard Cc: Jens Axboe, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, 2006-10-18 at 10:00 +0200, Jakob Oestergaard wrote: > On Tue, Oct 17, 2006 at 03:23:13PM +0200, Jens Axboe wrote: > > On Tue, Oct 17 2006, Arjan van de Ven wrote: > ... > > > Hi, > > > > > > it's a nice idea in theory. However... since IO bandwidth for seeks is > > > about 1% to 3% of that of sequential IO (on disks at least), which > > > bandwidth do you want to allocate? "worst case" you need to use the > > > all-seeks bandwidth, but that's so far away from "best case" that it may > > > well not be relevant in practice. Yet there are real world cases where > > > for a period of time you approach worst case behavior ;( > > > > Bandwidth reservation would have to be confined to special cases, you > > obviously cannot do it "in general" for the reasons Arjan lists above. > > How about allocating I/O operations instead of bandwidth ? > > So, any read is really a seek+read, and we count that as one I/O > operation. Same for writes. Hi, I can see that that makes it simple, but.. what would it MEAN? Eg what would a system administrator use it for? It then no longer means "my mp3 player is guaranteed to get the streaming mp3 from the disk at this bitrate" or something like that... so my question to you is: can you describe what it'd bring the admin to put such an allocation in place? If we find that it can be a good approach.. but if not, I'm less certain this'll be used.. Greetings, Arjan van de Ven ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 9:40 ` Arjan van de Ven @ 2006-10-18 11:30 ` Jakob Oestergaard 2006-10-18 11:49 ` Jens Axboe 0 siblings, 1 reply; 26+ messages in thread From: Jakob Oestergaard @ 2006-10-18 11:30 UTC (permalink / raw) To: Arjan van de Ven Cc: Jens Axboe, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18, 2006 at 11:40:56AM +0200, Arjan van de Ven wrote: ... > Hi, > > I can see that that makes it simple, but.. what would it MEAN? Eg what > would a system administrator use it for? For example, I could allocate "at least 100 iops/sec" for my database. The VMWare can take whatever is left. I have no idea how much bandwidth my database needs... But I have a rough idea about how many I/O operations it does for a given operation. And if I don't, strace can tell me pretty quick :) > It then no longer means "my mp3 > player is guaranteed to get the streaming mp3 from the disk at this > bitrate" or something like that... In a sense you are right. You cannot be certain that the mp3 player will get a specific bandwidth. The mp3 player will be accessing the underlying storage through a filesystem, which again means that accessing a file sequentially *will* cause non-sequential I/O on the underlying device(s). If you wanted to guarantee any specific bandwidth, you would somehow assume that you had an infinite (or at least very very high) number of seeks at your disposal. Or that seeks were free... In any other scenario, the total "capacity" of your underlying storage, the maximum amount of bandwidth (including non-free seeks) available, would vary depending on how it is currently used (how many seeks are issued) by all the clients. So, what I'm arguing is; you will not want to specify a fixed sequential bandwidth for your mp3 player. What you want to do is this: Allocate 5 iops/sec for your mp3 player because either a quick calculation - or - experience has shown that this is enough for it to keep its buffer from depleting at all times. Describing iops/sec for your mp3 player is at least as simple as sequential bitrate. The difference is, that you can implement iops/sec allocation whereas you cannot implement bitrate allocation (in a meaningful way at least) :) > so my question to you is: can you > describe what it'd bring the admin to put such an allocation in place? Limiting on iops/sec rather than bandwidth, is simply accepting that bandwidth does not make sense (because you cannot know how much of it you have and therefore you cannot slice up your total capacity), and, realizing that bandwidth in the scenarios where limiting is interesting is in reality bound by seeks rather than sequential on-disk throughput. > If we find that it can be a good approach.. but if not, I'm less certain > this'll be used.. I can only see a problem with specifying iops/sec in the one scenario where you have multiple sequential readers or writers, and you want to distribute bandwidth between them. However, in that scenario, where you have multiple clients, *seeks* will again be your limiting factor. Specifying iops/sec might be difficult for the admin. But I really can't see how you would implement bandwidth limiting in a meaningful way - and if you can't do that, then specifying bandwidth limiting in terms of a bandwidth limiting process that doesn't work properly will be even harder :) The only situation in which seeks is not either the limiting factor, or at least a very very large contributor to I/O wait, is in the situation where you have only one client. And, if you have only one client, what is it you need sharing of resources for again? In all other scenarios, I believe iops/sec is by far a superios way of describing the ressource allocation. For two reasons: 1) It describes what the hardware provides 2) By describing a concept based on the real world it may actually be possible to implement so that it works as intended I hope some of the above makes sense. I'll try to explain what I mean to the best of my ability :) -- / jakob ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 11:30 ` Jakob Oestergaard @ 2006-10-18 11:49 ` Jens Axboe 2006-10-18 12:23 ` Jakob Oestergaard 0 siblings, 1 reply; 26+ messages in thread From: Jens Axboe @ 2006-10-18 11:49 UTC (permalink / raw) To: Jakob Oestergaard, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18 2006, Jakob Oestergaard wrote: > On Wed, Oct 18, 2006 at 11:40:56AM +0200, Arjan van de Ven wrote: > ... > > Hi, > > > > I can see that that makes it simple, but.. what would it MEAN? Eg what > > would a system administrator use it for? > > For example, I could allocate "at least 100 iops/sec" for my database. > The VMWare can take whatever is left. > > I have no idea how much bandwidth my database needs... But I have a > rough idea about how many I/O operations it does for a given operation. > And if I don't, strace can tell me pretty quick :) That's crazy. So you want a user of this to strace and write a script parsing strace output to tell you possibly how many iops/sec you need? > > It then no longer means "my mp3 > > player is guaranteed to get the streaming mp3 from the disk at this > > bitrate" or something like that... > > In a sense you are right. > > You cannot be certain that the mp3 player will get a specific bandwidth. > The mp3 player will be accessing the underlying storage through a > filesystem, which again means that accessing a file sequentially *will* > cause non-sequential I/O on the underlying device(s). > > If you wanted to guarantee any specific bandwidth, you would somehow > assume that you had an infinite (or at least very very high) number of > seeks at your disposal. Or that seeks were free... In any other > scenario, the total "capacity" of your underlying storage, the maximum > amount of bandwidth (including non-free seeks) available, would vary > depending on how it is currently used (how many seeks are issued) by all > the clients. > > So, what I'm arguing is; you will not want to specify a fixed sequential > bandwidth for your mp3 player. > > What you want to do is this: Allocate 5 iops/sec for your mp3 player > because either a quick calculation - or - experience has shown that this > is enough for it to keep its buffer from depleting at all times. But that is the only number that makes sense. To give some sort of soft QOS for bandwidth, you need the file given so the kernel can bring in the meta data (to avoid those seeks) and see how the file is laid out. For the mp3 case, you should not even need to ask the user anything. The player app knows exactly how much bandwidth it needs and what kind of latency, if can tell from the bitrate of the media. What you are arguing for is doing trial and error with a magic iops/sec metric that is both hard to understand and impossible to quantify. > Describing iops/sec for your mp3 player is at least as simple as > sequential bitrate. The difference is, that you can implement iops/sec > allocation whereas you cannot implement bitrate allocation (in a > meaningful way at least) :) > > > > so my question to you is: can you > > describe what it'd bring the admin to put such an allocation in place? > > Limiting on iops/sec rather than bandwidth, is simply accepting that > bandwidth does not make sense (because you cannot know how much of it > you have and therefore you cannot slice up your total capacity), and, > realizing that bandwidth in the scenarios where limiting is interesting > is in reality bound by seeks rather than sequential on-disk throughput. I don't understand your arguments, to be honest. If you can tell the iops/sec rate for a given workload, you can certainly see the bandwidth as well. Both iops/sec and bandwidth will vary wildly depending on the workload(s) on the disk. > > If we find that it can be a good approach.. but if not, I'm less certain > > this'll be used.. > > I can only see a problem with specifying iops/sec in the one scenario > where you have multiple sequential readers or writers, and you want to > distribute bandwidth between them. If you only have one app doing io, you don't need QOS. The thing is, you always have competing apps. Even with only one user space app running, the kernel may still generate io for you. > In all other scenarios, I believe iops/sec is by far a superios way of > describing the ressource allocation. For two reasons: > 1) It describes what the hardware provides > 2) By describing a concept based on the real world it may actually be > possible to implement so that it works as intended Same arguments. You can't universally state that this disk gives you 80MiB/sec, and you can't universally state that this disk gives you 1000 iops/sec. You need to also define the conditions for when it can provide this performance. So if you instead say this disk does 80MiB/sec if read with at least 8KiB blocks from lba 0 to 50000 sequentially. Or you can state the same with iops/sec. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 11:49 ` Jens Axboe @ 2006-10-18 12:23 ` Jakob Oestergaard 2006-10-18 12:42 ` Alan Cox 2006-10-18 12:42 ` Jens Axboe 0 siblings, 2 replies; 26+ messages in thread From: Jakob Oestergaard @ 2006-10-18 12:23 UTC (permalink / raw) To: Jens Axboe Cc: Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18, 2006 at 01:49:14PM +0200, Jens Axboe wrote: > On Wed, Oct 18 2006, Jakob Oestergaard wrote: ... > > I have no idea how much bandwidth my database needs... But I have a > > rough idea about how many I/O operations it does for a given operation. > > And if I don't, strace can tell me pretty quick :) > > That's crazy. So you want a user of this to strace and write a script > parsing strace output to tell you possibly how many iops/sec you need? Come up with something better then, genious :) strace for iops is doable albeit complicated. Determining MiB/sec requirement for sufficient db performance is impossible. > > > > So, what I'm arguing is; you will not want to specify a fixed sequential > > bandwidth for your mp3 player. > > > > What you want to do is this: Allocate 5 iops/sec for your mp3 player > > because either a quick calculation - or - experience has shown that this > > is enough for it to keep its buffer from depleting at all times. > > But that is the only number that makes sense. To give some sort of soft > QOS for bandwidth, you need the file given so the kernel can bring in > the meta data (to avoid those seeks) and see how the file is laid out. Ok I see where you're going. I think it sounds very complicated - for the user and for the kernel. Would you want to limit bandwidth on a per-file or per-process basis? You're talking files, above, I was thinking about processes (consumers if you like) the whole time. Have you thought about how this would work in the long run, with many files coming into use? The kernel can't have the meta-data cached for all files - so the reading-in of metadata would affect the remaining available disk performance... > For the mp3 case, you should not even need to ask the user anything. The > player app knows exactly how much bandwidth it needs and what kind of > latency, if can tell from the bitrate of the media. Agreed. And this holds true for both base metrics, bandwidth or iops/sec. > What you are arguing > for is doing trial and error Sort-of correct. > with a magic iops/sec metric that is both > hard to understand and impossible to quantify. iops/sec is what you get from your disks. In real world scenarios. It's no more magic than the real world, and no harder to understand than real world disks. Although I admit real-world disks can be a bitch at times ;) My argument is that it is simpler to understand than bandwidth. Sure, for the streaming file example bandwidth sounds simple. But how many real-world applications are like that? What about databases? What about web servers? What about mail servers? What about 99% of the real-world applications out there that are not streaming audio or video players? > > Limiting on iops/sec rather than bandwidth, is simply accepting that > > bandwidth does not make sense (because you cannot know how much of it > > you have and therefore you cannot slice up your total capacity), and, > > realizing that bandwidth in the scenarios where limiting is interesting > > is in reality bound by seeks rather than sequential on-disk throughput. > > I don't understand your arguments, to be honest. If you can tell the > iops/sec rate for a given workload, you can certainly see the bandwidth > as well. My thesis is, that for most applications it is not the bandwidth you care about. If I am not right in this, sure, you have a point then. But hey, how many of the applications out there are mp3 players? (in other words; please oh please, prove me wrong, I like it :) > Both iops/sec and bandwidth will vary wildly depending on the > workload(s) on the disk. The total iops/sec "available" from a given disk will not vary a lot, compared to how the total bandwidth available from a given disk will vary. ... > > I can only see a problem with specifying iops/sec in the one scenario > > where you have multiple sequential readers or writers, and you want to > > distribute bandwidth between them. > > If you only have one app doing io, you don't need QOS. Precisely! In the *one* case where it is actually possible to implement a QOS system based on bandwidth, you don't need QOS. With more than 1 client, you get seeks, and then bandwidth is no longer a sensible measure. > The thing is, you > always have competing apps. Even with only one user space app running, > the kernel may still generate io for you. Sing it brother, sing it! ;) > > In all other scenarios, I believe iops/sec is by far a superios way of > > describing the ressource allocation. For two reasons: > > 1) It describes what the hardware provides > > 2) By describing a concept based on the real world it may actually be > > possible to implement so that it works as intended > > Same arguments. You can't universally state that this disk gives you > 80MiB/sec, and you can't universally state that this disk gives you 1000 > iops/sec. I agree. But I would be lying a lot less if I made the claim in iops/sec :) They will vary a factor of two or three, depending on their nature. Bandwidth will vary three to five orders of magnitude depending on the nature of the I/O operations issued to the device. > You need to also define the conditions for when it can provide > this performance. So if you instead say this disk does 80MiB/sec if read > with at least 8KiB blocks from lba 0 to 50000 sequentially. Or you can > state the same with iops/sec. Yep. However, for the interface to be useful, it needs two things as I see it (and I may well be overlooking something): 1) It needs to be simple to use 2) It needs to do what it claims, "well enough" -- / jakob ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 12:23 ` Jakob Oestergaard @ 2006-10-18 12:42 ` Alan Cox 2006-10-18 12:44 ` Jens Axboe 2006-10-18 12:44 ` Jakob Oestergaard 2006-10-18 12:42 ` Jens Axboe 1 sibling, 2 replies; 26+ messages in thread From: Alan Cox @ 2006-10-18 12:42 UTC (permalink / raw) To: Jakob Oestergaard Cc: Jens Axboe, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel Ar Mer, 2006-10-18 am 14:23 +0200, ysgrifennodd Jakob Oestergaard: > iops/sec is what you get from your disks. In real world scenarios. It's > no more magic than the real world, and no harder to understand than real > world disks. Although I admit real-world disks can be a bitch at times ;) Even iops/sec is very vague and arbitary. If your disk happens to be retrying a sector or doing a cleaning pass or any other housekeeping or vibration damping and so on you'll get very different numbers. Bandwidth is completely silly in this context, iops/sec is merely hopeless 8) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 12:42 ` Alan Cox @ 2006-10-18 12:44 ` Jens Axboe 2006-10-18 12:55 ` Nick Piggin 2006-10-18 12:44 ` Jakob Oestergaard 1 sibling, 1 reply; 26+ messages in thread From: Jens Axboe @ 2006-10-18 12:44 UTC (permalink / raw) To: Alan Cox Cc: Jakob Oestergaard, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18 2006, Alan Cox wrote: > Bandwidth is completely silly in this context, iops/sec is merely > hopeless 8) Both need the disk to play nicely, if you get into error handling or correction, you get screwed. Bandwidth by itself is meaningless, you need latency as well to make sense of it. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 12:44 ` Jens Axboe @ 2006-10-18 12:55 ` Nick Piggin 2006-10-18 13:04 ` Jens Axboe 2006-10-18 13:37 ` Jakob Oestergaard 0 siblings, 2 replies; 26+ messages in thread From: Nick Piggin @ 2006-10-18 12:55 UTC (permalink / raw) To: Jens Axboe Cc: Alan Cox, Jakob Oestergaard, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel Jens Axboe wrote: > On Wed, Oct 18 2006, Alan Cox wrote: > >>Bandwidth is completely silly in this context, iops/sec is merely >>hopeless 8) > > > Both need the disk to play nicely, if you get into error handling or > correction, you get screwed. Bandwidth by itself is meaningless, you > need latency as well to make sense of it. When writing an IO scheduler, I decided `time' was a pretty good metric. That's roughly what we use for CPU scheduling as well (but use nice levels and adjusted by dynamic priorities instead of a straight % share). So you could say you want your database to consume no more than 50% of disk and have your mp3 player get a minimum of 10%. Of course, that doesn't say anything about what the time slices are, or what latencies you can expect (1s out of every 10, or 100ms out of every 1000?). It is still far from perfect, but at least it accounts for seeks vs throughput reasonably well, and in a device independent manner. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 12:55 ` Nick Piggin @ 2006-10-18 13:04 ` Jens Axboe 2006-10-18 13:39 ` Jakob Oestergaard 2006-10-18 13:51 ` Paulo Marques 2006-10-18 13:37 ` Jakob Oestergaard 1 sibling, 2 replies; 26+ messages in thread From: Jens Axboe @ 2006-10-18 13:04 UTC (permalink / raw) To: Nick Piggin Cc: Alan Cox, Jakob Oestergaard, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18 2006, Nick Piggin wrote: > Jens Axboe wrote: > >On Wed, Oct 18 2006, Alan Cox wrote: > > > >>Bandwidth is completely silly in this context, iops/sec is merely > >>hopeless 8) > > > > > >Both need the disk to play nicely, if you get into error handling or > >correction, you get screwed. Bandwidth by itself is meaningless, you > >need latency as well to make sense of it. > > When writing an IO scheduler, I decided `time' was a pretty good > metric. That's roughly what we use for CPU scheduling as well (but > use nice levels and adjusted by dynamic priorities instead of a > straight % share). Precisely, hence CFQ is now based on the time metric. Given larger slices, you can mostly eliminate the impact of other applications in the system. > So you could say you want your database to consume no more than 50% > of disk and have your mp3 player get a minimum of 10%. Of course, > that doesn't say anything about what the time slices are, or what > latencies you can expect (1s out of every 10, or 100ms out of every > 1000?). As I wrote previously, both a percentage and bandwidth along with desired latency make sense. For the mp3 player, you probably don't care how much of the system it uses. You want 256kbit/sec or whatever you media is, and if you don't get that then things don't work. The other scenario is limiting eg the database. > It is still far from perfect, but at least it accounts for seeks vs > throughput reasonably well, and in a device independent manner. We can't aim for perfection, as that is simple not doable generically. So we have to settle for something that makes sense and is enforcible to some extent. We can limit something to foo percentage of the disk, and we can try the hardest possible to satisfy the mp3 player as long as we know what it requires. Right now we don't, so we treat everybody the same wrt slices and latency. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 13:04 ` Jens Axboe @ 2006-10-18 13:39 ` Jakob Oestergaard 2006-10-18 13:51 ` Paulo Marques 1 sibling, 0 replies; 26+ messages in thread From: Jakob Oestergaard @ 2006-10-18 13:39 UTC (permalink / raw) To: Jens Axboe Cc: Nick Piggin, Alan Cox, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18, 2006 at 03:04:57PM +0200, Jens Axboe wrote: ... > > So you could say you want your database to consume no more than 50% > > of disk and have your mp3 player get a minimum of 10%. Of course, > > that doesn't say anything about what the time slices are, or what > > latencies you can expect (1s out of every 10, or 100ms out of every > > 1000?). > > As I wrote previously, both a percentage and bandwidth along with > desired latency make sense. The fundamental problem I see, is, that while you can easily measure and limit the bandwidth, you cannot really make bandwidth guarantees. All the rest sounds great in my ears :) -- / jakob ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 13:04 ` Jens Axboe 2006-10-18 13:39 ` Jakob Oestergaard @ 2006-10-18 13:51 ` Paulo Marques 2006-10-19 12:22 ` Jens Axboe 1 sibling, 1 reply; 26+ messages in thread From: Paulo Marques @ 2006-10-18 13:51 UTC (permalink / raw) To: Jens Axboe Cc: Nick Piggin, Alan Cox, Jakob Oestergaard, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel Jens Axboe wrote: >[...] > Precisely, hence CFQ is now based on the time metric. Given larger > slices, you can mostly eliminate the impact of other applications in the > system. Just one thought: we can't predict reliably how much time a request will take to be serviced, but we can account the time it _took_ to service a request. If we account the time it took to service requests for each process, and we have several processes with requests pending, we can use the same algorithm we would use for a large time slice algorithm to select the process to service. This should make it as fair over time as a large time slice algorithm and doesn't need large time slices, so latencies can be kept as low as required. However, having a small time slice will probably help the hardware coalesce several request from the same process that are more likely to be to nearby sectors, and thus improve performance. I'm leaving out the details, like we should find a way to make the "fairness" work over a time window and not over the entire process lifespan, maybe by using a sliding window over the last N seconds of serviced requests to do the accounting or something. -- Paulo Marques - www.grupopie.com "The face of a child can say it all, especially the mouth part of the face." ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 13:51 ` Paulo Marques @ 2006-10-19 12:22 ` Jens Axboe 0 siblings, 0 replies; 26+ messages in thread From: Jens Axboe @ 2006-10-19 12:22 UTC (permalink / raw) To: Paulo Marques Cc: Nick Piggin, Alan Cox, Jakob Oestergaard, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18 2006, Paulo Marques wrote: > Jens Axboe wrote: > >[...] > >Precisely, hence CFQ is now based on the time metric. Given larger > >slices, you can mostly eliminate the impact of other applications in the > >system. > > Just one thought: we can't predict reliably how much time a request will > take to be serviced, but we can account the time it _took_ to service a > request. > > If we account the time it took to service requests for each process, and > we have several processes with requests pending, we can use the same > algorithm we would use for a large time slice algorithm to select the > process to service. > > This should make it as fair over time as a large time slice algorithm > and doesn't need large time slices, so latencies can be kept as low as > required. Two problems: - You can't chop things down to single request times. A cost of a request greatly varies depending on what preceeded it, hence you need to account batches of requests from a process - this is what the time slice currently accomplishes. - Whether a process has requests pending or not varies a lot. The typical bandwidth problem is due to processes doing sync or dependent io where you only get io in pieces over time. A request based approach only works over processes that always (or almost always) have work left to do. You absolutely need the time slice or some other waiting mechanism to help those that don't. > However, having a small time slice will probably help the hardware > coalesce several request from the same process that are more likely to > be to nearby sectors, and thus improve performance. Either the process is submittinger larger amounts of io and you'll get the merging anyways, or it isn't. There's a large difference in time scales here. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 12:55 ` Nick Piggin 2006-10-18 13:04 ` Jens Axboe @ 2006-10-18 13:37 ` Jakob Oestergaard 1 sibling, 0 replies; 26+ messages in thread From: Jakob Oestergaard @ 2006-10-18 13:37 UTC (permalink / raw) To: Nick Piggin Cc: Jens Axboe, Alan Cox, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18, 2006 at 10:55:55PM +1000, Nick Piggin wrote: ... > So you could say you want your database to consume no more than 50% > of disk and have your mp3 player get a minimum of 10%. Of course, > that doesn't say anything about what the time slices are, or what > latencies you can expect (1s out of every 10, or 100ms out of every > 1000?). > > It is still far from perfect, but at least it accounts for seeks vs > throughput reasonably well, and in a device independent manner. Yup - it makes sense. It would make very good sense (to me at least) if you can say "give me at least 100msec every 1sec", as was already suggested. That would take care of the latency problem too. -- / jakob ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 12:42 ` Alan Cox 2006-10-18 12:44 ` Jens Axboe @ 2006-10-18 12:44 ` Jakob Oestergaard 1 sibling, 0 replies; 26+ messages in thread From: Jakob Oestergaard @ 2006-10-18 12:44 UTC (permalink / raw) To: Alan Cox Cc: Jens Axboe, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18, 2006 at 01:42:24PM +0100, Alan Cox wrote: > Ar Mer, 2006-10-18 am 14:23 +0200, ysgrifennodd Jakob Oestergaard: > > iops/sec is what you get from your disks. In real world scenarios. It's > > no more magic than the real world, and no harder to understand than real > > world disks. Although I admit real-world disks can be a bitch at times ;) > > Even iops/sec is very vague and arbitary. If your disk happens to be > retrying a sector or doing a cleaning pass or any other housekeeping or > vibration damping and so on you'll get very different numbers. True. > > Bandwidth is completely silly in this context, iops/sec is merely > hopeless 8) Thanks Alan - I feel much better now :) -- / jakob ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 12:23 ` Jakob Oestergaard 2006-10-18 12:42 ` Alan Cox @ 2006-10-18 12:42 ` Jens Axboe 2006-10-18 13:35 ` Jakob Oestergaard 1 sibling, 1 reply; 26+ messages in thread From: Jens Axboe @ 2006-10-18 12:42 UTC (permalink / raw) To: Jakob Oestergaard, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18 2006, Jakob Oestergaard wrote: > On Wed, Oct 18, 2006 at 01:49:14PM +0200, Jens Axboe wrote: > > On Wed, Oct 18 2006, Jakob Oestergaard wrote: > ... > > > I have no idea how much bandwidth my database needs... But I have a > > > rough idea about how many I/O operations it does for a given operation. > > > And if I don't, strace can tell me pretty quick :) > > > > That's crazy. So you want a user of this to strace and write a script > > parsing strace output to tell you possibly how many iops/sec you need? > > Come up with something better then, genious :) > > strace for iops is doable albeit complicated. The concept was already described, bandwidth. > Determining MiB/sec requirement for sufficient db performance is > impossible. But you can say you want to give the db 90% of the disk bandwidth, and at least 50%. The iops/sec metric doesn't help you. It's an entirely diffent thing from the mp3 player. With the player app, you want to have the bitrate available at the right latency. For a db, I guess you typically want to contain it somehow - make sure it gets at least foo amount of the disk, but don't let it suck everything. > > > So, what I'm arguing is; you will not want to specify a fixed sequential > > > bandwidth for your mp3 player. > > > > > > What you want to do is this: Allocate 5 iops/sec for your mp3 player > > > because either a quick calculation - or - experience has shown that this > > > is enough for it to keep its buffer from depleting at all times. > > > > But that is the only number that makes sense. To give some sort of soft > > QOS for bandwidth, you need the file given so the kernel can bring in > > the meta data (to avoid those seeks) and see how the file is laid out. > > Ok I see where you're going. I think it sounds very complicated - for > the user and for the kernel. > > Would you want to limit bandwidth on a per-file or per-process basis? > You're talking files, above, I was thinking about processes (consumers > if you like) the whole time. You need to define your workload for the kernel to know what to do. So for the bandwidth case, you need to tell the kernel against what file you want to allocate that bandwidth. If you go the percentage route, you don't need that. The percentage route doesn't care about sequential or random io, it just gets you foo % of the disk time. If the slice given is large enough, with 10% of the disk time you may have 90% of the total bandwidth if the remaining 90% of the time is spent doing random io. But you still have 10% of the time allocated. > Have you thought about how this would work in the long run, with many > files coming into use? The kernel can't have the meta-data cached for > all files - so the reading-in of metadata would affect the remaining > available disk performance... Just like any other system activity affects the disk bandwidth. That's exactly one of the reasons why you want to operate in terms of time, not requests. > > For the mp3 case, you should not even need to ask the user anything. The > > player app knows exactly how much bandwidth it needs and what kind of > > latency, if can tell from the bitrate of the media. > > Agreed. And this holds true for both base metrics, bandwidth or iops/sec. Right, because they are sides of the same story. The difference is not in the metric, but the meaning it gives to the user. > > What you are arguing > > for is doing trial and error > > Sort-of correct. How would you otherwise do it? > > with a magic iops/sec metric that is both > > hard to understand and impossible to quantify. > > iops/sec is what you get from your disks. In real world scenarios. It's > no more magic than the real world, and no harder to understand than real > world disks. Although I admit real-world disks can be a bitch at times ;) Again, iops/sec doesn't make sense unless you say how big the iops is and what your stream of iops look like. That's why I say it's a benchmark metric. > My argument is that it is simpler to understand than bandwidth. And mine is that that is nonsense :-) > Sure, for the streaming file example bandwidth sounds simple. But how > many real-world applications are like that? What about databases? What > about web servers? What about mail servers? What about 99% of the > real-world applications out there that are not streaming audio or video > players? Reserving bandwidth at x kib/sec for an mp3 player and containing a different type of app are two separate things. A decent io scheduler should make sure in general that nobody is totally starved. If you have 5 services running on your machine and you want to make sure that eg the web server gets 50% of the bandwidth, you will want to inform the kernel of that fact. Since you don't know what the throughput of the disk is at any given time (be it Mib/sec or iops/sec, doesn't matter), you can only say 50% at that time. I really don't see how this pertains to bandwidth vs iops/sec. > > > Limiting on iops/sec rather than bandwidth, is simply accepting that > > > bandwidth does not make sense (because you cannot know how much of it > > > you have and therefore you cannot slice up your total capacity), and, > > > realizing that bandwidth in the scenarios where limiting is interesting > > > is in reality bound by seeks rather than sequential on-disk throughput. > > > > I don't understand your arguments, to be honest. If you can tell the > > iops/sec rate for a given workload, you can certainly see the bandwidth > > as well. > > My thesis is, that for most applications it is not the bandwidth you > care about. > > If I am not right in this, sure, you have a point then. But hey, how > many of the applications out there are mp3 players? (in other words; > please oh please, prove me wrong, I like it :) We are talking about two seperate things here. The mp3 player vs some other app argument is totally separate from iops/sec vs MiB/sec. > > Both iops/sec and bandwidth will vary wildly depending on the > > workload(s) on the disk. > > The total iops/sec "available" from a given disk will not vary a lot, > compared to how the total bandwidth available from a given disk will > vary. That's only true if you scale your iops. And how are you going to give that number? You need to define what an iop is for it to be meaningfull. > > > I can only see a problem with specifying iops/sec in the one scenario > > > where you have multiple sequential readers or writers, and you want to > > > distribute bandwidth between them. > > > > If you only have one app doing io, you don't need QOS. > > Precisely! > > In the *one* case where it is actually possible to implement a QOS > system based on bandwidth, you don't need QOS. > > With more than 1 client, you get seeks, and then bandwidth is no longer > a sensible measure. And neither is iops/sec. But things don't deteriorate that quickly, if you can tolerate higher latency, it's quite possible to have most of the potential bandwidth available for > 1 client workloads. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 12:42 ` Jens Axboe @ 2006-10-18 13:35 ` Jakob Oestergaard 0 siblings, 0 replies; 26+ messages in thread From: Jakob Oestergaard @ 2006-10-18 13:35 UTC (permalink / raw) To: Jens Axboe Cc: Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18, 2006 at 02:42:53PM +0200, Jens Axboe wrote: ... > > impossible. > > But you can say you want to give the db 90% of the disk bandwidth, and > at least 50%. The iops/sec metric doesn't help you. I think we're misunderstanding each other... I am trying to say, that me being able to specify "90% of the disk bandwidth" does not help me. Because the DB would probably be happy with just 1% of the 100MiB/sec theoretical bandwidth I could get from sequentially reading the disk - but if it needs to do, say, 160 seeks per second to get those 1% of 100MiB/sec, then that is still more than 96% of the disk time available with a 6ms seek time. So, I believe we need something that takes into account the general performance of the disk - not just the single-user-sequential-read/write bandwidth. And, as I shall soon argue, this is where I do think the iops/sec metric does help - I probably just explained it very poorly to begin with. > > > > Would you want to limit bandwidth on a per-file or per-process basis? > > You're talking files, above, I was thinking about processes (consumers > > if you like) the whole time. > > You need to define your workload for the kernel to know what to do. So > for the bandwidth case, you need to tell the kernel against what file > you want to allocate that bandwidth. If you go the percentage route, you > don't need that. The percentage route doesn't care about sequential or > random io, it just gets you foo % of the disk time. If the slice given > is large enough, with 10% of the disk time you may have 90% of the total > bandwidth if the remaining 90% of the time is spent doing random io. But > you still have 10% of the time allocated. I like the time allocation for several reasons: 1) It's presumably simple to implement 2) It will suit both your mp3 player and my database reasonably well 3) It's intuitive to the user - you can understand wall-clock time a lot easier than all the little things than influence whether or not you get a number of bytes written in a number of places on the disk in more or less than the time you had available... I think "reasonably well" is good enough for a kernel that isn't hard-real-time anyway :) ... [snip - good arguments, response will follow] ... > > > with a magic iops/sec metric that is both > > > hard to understand and impossible to quantify. > > > > iops/sec is what you get from your disks. In real world scenarios. It's > > no more magic than the real world, and no harder to understand than real > > world disks. Although I admit real-world disks can be a bitch at times ;) > > Again, iops/sec doesn't make sense unless you say how big the iops is 1 OSIOP (oestergaard standard input/output operation) is hereby defined to be: 1 optional seek plus 1 (read or write) of no more than 256 KiB (*) (*): The size limit should be adjusted every 10 years as disk technology evolves. There you have it :) So, a single 1MiB read on a disk is 4 OSIOPs, for example. > and what your stream of iops look like. That's why I say it's a > benchmark metric. I state that the total OSIOPs/second you can get out of a given disk will not change by much, no matter which disk operations you perform and how you mix them. That was the whole point of using OSIOPs/sec rather than bandwidth to begin with. I know I did not properly define the iop to begin with - my bad, sorry. > > > My argument is that it is simpler to understand than bandwidth. > > And mine is that that is nonsense :-) Still? :) I hope the above clears up some of the misunderstandings. ... ... > > The total iops/sec "available" from a given disk will not vary a lot, > > compared to how the total bandwidth available from a given disk will > > vary. > > That's only true if you scale your iops. And how are you going to give > that number? You need to define what an iop is for it to be meaningfull. Done :) A basic OSIOP is useful for the application, because it maps very closely to the read/write/seek API that applications are built over. Thus, the application will know very well how many OSIOPs it needs in order to complete a given job. The total number of OSIOPs/sec available in the system, however, will vary depending on the characteristics of the disk subsystem. Just like available cycles/sec vary with the speed of your processor. You are correct in that the total number of OSIOPs/sec will not be strictly constant over time - it will depend *somewhat* on the nature of the operations performed. But it will not change completely - or at least this is what I claim :) ... > > With more than 1 client, you get seeks, and then bandwidth is no longer > > a sensible measure. > > And neither is iops/sec. We agree that neither is "correct". I still claim that one is "not strictly correct but probably close enough to be useful". > But things don't deteriorate that quickly, if > you can tolerate higher latency, it's quite possible to have most of the > potential bandwidth available for > 1 client workloads. True. I do wonder, though, how often that would be practically useful. Seek times are *huge* (milliseconds) compared to almost anything else we work with. -- / jakob ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 8:00 ` Jakob Oestergaard 2006-10-18 9:40 ` Arjan van de Ven @ 2006-10-18 9:51 ` Jens Axboe 2006-10-18 11:00 ` Helge Hafting 1 sibling, 1 reply; 26+ messages in thread From: Jens Axboe @ 2006-10-18 9:51 UTC (permalink / raw) To: Jakob Oestergaard Cc: Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18 2006, Jakob Oestergaard wrote: > On Tue, Oct 17, 2006 at 03:23:13PM +0200, Jens Axboe wrote: > > On Tue, Oct 17 2006, Arjan van de Ven wrote: > ... > > > Hi, > > > > > > it's a nice idea in theory. However... since IO bandwidth for seeks is > > > about 1% to 3% of that of sequential IO (on disks at least), which > > > bandwidth do you want to allocate? "worst case" you need to use the > > > all-seeks bandwidth, but that's so far away from "best case" that it may > > > well not be relevant in practice. Yet there are real world cases where > > > for a period of time you approach worst case behavior ;( > > > > Bandwidth reservation would have to be confined to special cases, you > > obviously cannot do it "in general" for the reasons Arjan lists above. > > How about allocating I/O operations instead of bandwidth ? > > So, any read is really a seek+read, and we count that as one I/O > operation. Same for writes. > > Since the total "capacity" of the system is typically (in real-world > scenarios) the number of operations (seek+X) rather than the raw > sequential bandwidth anyway, I suppose that I/O operations would be what > you wanted to allocate anyway. > > Anyway, just a thought... While that may make some sense internally, the exported interface would never be workable like that. It needs to be simple, "give me foo kb/sec with max latency bar for this file", with an access pattern or assumed sequential io. Nobody speaks of iops/sec except some silly benchmark programs. I know that you are describing pseudo-iops, but it still doesn't make it more clear. Things aren't as simple -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 9:51 ` Jens Axboe @ 2006-10-18 11:00 ` Helge Hafting 2006-10-18 11:14 ` Jens Axboe 2006-10-18 11:23 ` Ric Wheeler 0 siblings, 2 replies; 26+ messages in thread From: Helge Hafting @ 2006-10-18 11:00 UTC (permalink / raw) To: Jens Axboe Cc: Jakob Oestergaard, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel Jens Axboe wrote: > While that may make some sense internally, the exported interface would > never be workable like that. It needs to be simple, "give me foo kb/sec > with max latency bar for this file", with an access pattern or assumed > sequential io. > > Nobody speaks of iops/sec except some silly benchmark programs. I know > that you are describing pseudo-iops, but it still doesn't make it more > clear. > Things aren't as simple > How about "give me 10% of total io capacity?" People understand this, and the io scheduler can then guarantee this by ensuring that the process gets 1 out of 10 io requests as long as it keeps submitting enough. The admin can then set a reasonable percentage depending on the machine's capacity. Helge Hafting ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 11:00 ` Helge Hafting @ 2006-10-18 11:14 ` Jens Axboe 2006-10-18 11:23 ` Ric Wheeler 1 sibling, 0 replies; 26+ messages in thread From: Jens Axboe @ 2006-10-18 11:14 UTC (permalink / raw) To: Helge Hafting Cc: Jakob Oestergaard, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel On Wed, Oct 18 2006, Helge Hafting wrote: > Jens Axboe wrote: > >While that may make some sense internally, the exported interface would > >never be workable like that. It needs to be simple, "give me foo kb/sec > >with max latency bar for this file", with an access pattern or assumed > >sequential io. > > > >Nobody speaks of iops/sec except some silly benchmark programs. I know > >that you are describing pseudo-iops, but it still doesn't make it more > >clear. > >Things aren't as simple > > > How about "give me 10% of total io capacity?" People understand > this, and the io scheduler can then guarantee this by ensuring > that the process gets 1 out of 10 io requests as long as it > keeps submitting enough. The thing about disks is that it's not as easy as giving the process 10% of the io requests issued. Only if the considered bandwidth is random load will that work, but that's not very interesting. You need to say 10% of the disk time, which is something CFQ can very easily be modified to do since it works with time slices already. 10% doesn't mean very much though, you need a timeframe for that to make sense anyways. Give me 100msec every 1000msecs makes more sense. -- Jens Axboe ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Bandwidth Allocations under CFQ I/O Scheduler 2006-10-18 11:00 ` Helge Hafting 2006-10-18 11:14 ` Jens Axboe @ 2006-10-18 11:23 ` Ric Wheeler 1 sibling, 0 replies; 26+ messages in thread From: Ric Wheeler @ 2006-10-18 11:23 UTC (permalink / raw) To: Helge Hafting Cc: Jens Axboe, Jakob Oestergaard, Arjan van de Ven, Phetteplace, Thad (GE Healthcare, consultant), linux-kernel Helge Hafting wrote: > Jens Axboe wrote: > >> While that may make some sense internally, the exported interface would >> never be workable like that. It needs to be simple, "give me foo kb/sec >> with max latency bar for this file", with an access pattern or assumed >> sequential io. >> >> Nobody speaks of iops/sec except some silly benchmark programs. I know >> that you are describing pseudo-iops, but it still doesn't make it more >> clear. >> Things aren't as simple >> > > How about "give me 10% of total io capacity?" People understand > this, and the io scheduler can then guarantee this by ensuring > that the process gets 1 out of 10 io requests as long as it > keeps submitting enough. > > The admin can then set a reasonable percentage depending on > the machine's capacity. > > Helge Hafting The tricky part is that when you mix up workloads, you blow the drive's ability to minimize head seek & rotational latency. For example, I have measured almost a 10x decrease when I mix one serious workload (reading each file in a large file system as fast as you can) with a moderate write workload. All a long winded way of saying that what we might be able to do in the worst case is to give an even portion of that worst case IO capability which is itself only 10% of the best case (i.e., 1% of the non-shared best case) ;-) Some of the high ends arrays (like the EMC Symmetrix, IBM Shark, Hitachi boxes, etc) are much better at this sharing since they have massive amounts of nonvolatile DRAM & lots of algorithmic ability to tease apart individual streams internally. Note that they have to do this since they are connected up to many different hosts. It might be interesting to thinking about how we would tweak things for this specific class of arrays as a special case, ric ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2006-10-19 12:21 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-10-16 20:46 Bandwidth Allocations under CFQ I/O Scheduler Phetteplace, Thad (GE Healthcare, consultant) 2006-10-17 1:24 ` Arjan van de Ven 2006-10-17 13:23 ` Jens Axboe 2006-10-17 14:37 ` Ric Wheeler 2006-10-17 14:47 ` Jens Axboe 2006-10-17 14:46 ` Phetteplace, Thad (GE Healthcare, consultant) 2006-10-18 8:00 ` Jakob Oestergaard 2006-10-18 9:40 ` Arjan van de Ven 2006-10-18 11:30 ` Jakob Oestergaard 2006-10-18 11:49 ` Jens Axboe 2006-10-18 12:23 ` Jakob Oestergaard 2006-10-18 12:42 ` Alan Cox 2006-10-18 12:44 ` Jens Axboe 2006-10-18 12:55 ` Nick Piggin 2006-10-18 13:04 ` Jens Axboe 2006-10-18 13:39 ` Jakob Oestergaard 2006-10-18 13:51 ` Paulo Marques 2006-10-19 12:22 ` Jens Axboe 2006-10-18 13:37 ` Jakob Oestergaard 2006-10-18 12:44 ` Jakob Oestergaard 2006-10-18 12:42 ` Jens Axboe 2006-10-18 13:35 ` Jakob Oestergaard 2006-10-18 9:51 ` Jens Axboe 2006-10-18 11:00 ` Helge Hafting 2006-10-18 11:14 ` Jens Axboe 2006-10-18 11:23 ` Ric Wheeler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox