From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759840Ab0I0WEr (ORCPT ); Mon, 27 Sep 2010 18:04:47 -0400 Received: from mx1.fusionio.com ([64.244.102.30]:35461 "EHLO mx1.fusionio.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752289Ab0I0WEq (ORCPT ); Mon, 27 Sep 2010 18:04:46 -0400 X-ASG-Debug-ID: 1285625083-0e1287900001-xx1T2L X-Barracuda-Envelope-From: JAxboe@fusionio.com Message-ID: <4CA114F8.8000102@fusionio.com> Date: Tue, 28 Sep 2010 07:04:40 +0900 From: Jens Axboe MIME-Version: 1.0 To: Vivek Goyal CC: Jan Kara , LKML , "jmoyer@redhat.com" , Lennart Poettering Subject: Re: Request starvation with CFQ References: <20100927190024.GF3610@quack.suse.cz> <20100927200232.GA2377@redhat.com> X-ASG-Orig-Subj: Re: Request starvation with CFQ In-Reply-To: <20100927200232.GA2377@redhat.com> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit X-Barracuda-Connect: mail1.int.fusionio.com[10.101.1.21] X-Barracuda-Start-Time: 1285625083 X-Barracuda-URL: http://10.101.1.180:8000/cgi-mod/mark.cgi X-Barracuda-Bayes: INNOCENT GLOBAL 0.0000 1.0000 -2.0210 X-Barracuda-Spam-Score: -2.02 X-Barracuda-Spam-Status: No, SCORE=-2.02 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.42078 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2010-09-28 05:02, Vivek Goyal wrote: > On Mon, Sep 27, 2010 at 09:00:24PM +0200, Jan Kara wrote: >> Hi, >> >> when helping Lennart with answering some questions, I've spotted the >> following problem (at least I think it's a problem ;): The thing is that >> CFQ schedules how requests should be dispatched but does not in any >> significant way limit to whom requests get allocated. Given we have a >> quite limited pool of available requests it can happen that processes >> will be actually starved not waiting for disk but waiting for requests >> getting allocated and any IO scheduling priorities or classes will not >> have serious effect. >> A pathological example I've tried below: >> #include >> #include >> #include >> #include >> >> int main(void) >> { >> int fd = open("/dev/vdb", O_RDONLY); >> int loop = 0; >> >> if (fd < 0) { >> perror("open"); >> exit(1); >> } >> while (1) { >> if (loop % 100 == 0) >> printf("Loop %d\n", loop); >> posix_fadvise(fd, (random() * 4096) % 1000204886016ULL, 4096, POSIX_FADV_WILLNEED); >> loop++; >> } >> } >> >> This program will just push as many requests as possible to the block >> layer and does not wait for any IO. Thus it will basically ignore any >> decisions about when requests get dispatched. BTW, don't get distracted >> by the fact that the program operates directly on the device, that is just >> for simplicity. Large enough file would work the same way. >> Even though I run this program with ionice -c 3, I still see that any >> other IO to the device is basically stalled. When I look at the block >> traces, I indeed see that what happens is that the above program submits >> requests until there are no more available: >> ... >> 254,16 2 802 1.411285520 2563 Q R 696733184 + 8 [random_read] >> 254,16 2 803 1.411314880 2563 G R 696733184 + 8 [random_read] >> 254,16 2 804 1.411338220 2563 I R 696733184 + 8 [random_read] >> 254,16 2 805 1.411415040 2563 Q R 1006864600 + 8 [random_read] >> 254,16 2 806 1.411441620 2563 S R 1006864600 + 8 [random_read] >> >> during and after that IO happens: >> 254,16 3 31 1.417898030 0 C R 345134640 + 8 [0] >> 254,16 3 32 1.418171910 0 D R 1524771568 + 8 [swapper] >> 254,16 0 33 1.432317140 0 C R 1524771568 + 8 [0] >> 254,16 0 34 1.432597000 0 D R 1077270768 + 8 [swapper] >> ... >> 254,16 0 35 1.503238050 0 C R 33633744 + 8 [0] >> 254,16 0 36 1.503558290 0 D R 22178968 + 8 [swapper] >> >> and the other program comes with IO and gets stalled: >> 254,16 1 39 1.508843180 2564 A RM 12346 + 8 <- (254,17) 12312 >> 254,16 1 40 1.508876520 2564 Q RM 12346 + 8 [ls] >> 254,16 1 41 1.508905140 2564 S RM 12346 + 8 [ls] >> ... >> IO is still running: >> 254,16 2 807 1.512081560 0 C R 22178968 + 8 [0] >> 254,16 2 808 1.512365010 0 D R 475025688 + 8 [swapper] >> 254,16 3 35 1.522113270 0 C R 475025688 + 8 [0] >> 254,16 3 36 1.522390779 0 D R 697010128 + 8 [swapper] >> 254,16 4 33 1.531443760 0 C R 697010128 + 8 [0] >> ... >> random reader even gets to submitting more requests: >> 254,16 2 815 1.785734950 2563 G R 1006864600 + 8 [random_read] >> 254,16 2 816 1.785752290 2563 I R 1006864600 + 8 [random_read] >> 254,16 2 817 1.785825880 2563 Q R 832683552 + 8 [random_read] >> 254,16 2 818 1.785850890 2563 G R 832683552 + 8 [random_read] >> 254,16 2 819 1.785874610 2563 I R 832683552 + 8 [random_read] >> ... >> and finally our program gets to adding it's request as well: >> 254,16 1 60 2.160884040 2564 G RM 12346 + 8 [ls] >> 254,16 1 61 2.160914700 2564 I R 12346 + 8 [ls] >> 254,16 1 62 2.161142170 2564 D R 12346 + 8 [ls] >> 254,16 1 63 2.161233670 2564 U N [ls] 128 >> >> I can provide the full traces for download if someone is interested >> in some part I didn't include here. The kernel is 2.6.36-rc4. >> Now I agree that the above program is about as bad as it can get but >> Lennart would like to implement readahead during boot on background and >> I believe that could starve other IO in a similar way. So any idea how >> to solve this? To me it seems as if we also needed to somehow limit the >> number of allocated requests per cfqq but OTOH we have to be really careful >> to not harm common workloads where we benefit from having lots of requests >> queued... > > Hi Jan, > > True that during request allocation, there is no consideration for ioprio. > I think the whole logic is round robin, where after getting a bunch of > request each process is put to sleep in the queue and then we do round > robin on all waiters. This should in general be an issue with request > queue and not just CFQ. > > So if there are bunch of threads which are very bullish on doing IO, and > there is a dependent reader, read latencies will shoot up. > > In fact current implementation of blkio controller also suffers with this > limitation because we don't yet have per group request descriptors and > once request queue is congested, requests from one group can get stuck > behind the requests from other group. > > One way forward could be to implement per cgroup request descriptors and > put this readahead thread into a separate cgroup of low weight. > > Other could be to implemnet some kind of request quota per priority level. > This is similar to per cgroup quota I talked above, just one level below. > > Third could be ad-hoc way of putting some limit on per cfqq. But I think a > process can easily circumvent that by forking off child which are not > sharing cfq context and then we are back to same situaiton. > > A very hackish solution could be to try to increase nr_requests on the > queue to say 1024. This will work only if you know that read-ahead process > does some limited amount of read-ahead and does not overwhelm the queue > with more than 1024 requets. And then use ioprio with low prio for > read-ahead process. I don't think that is necessarily hackish. The current rq allocation batching and accounting is pretty horrible imho, in fact in recent patches I ripped that out. The vm copes a lot better with larger depths these days, so what I want to add is just a per-ioc queue limit instead. -- Jens Axboe