public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] cfq-iosched: allow groups preemption for sync-noidle workloads
@ 2011-06-23 16:21 Konstantin Khlebnikov
  2011-06-23 17:26 ` Vivek Goyal
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Konstantin Khlebnikov @ 2011-06-23 16:21 UTC (permalink / raw)
  To: Jens Axboe, linux-kernel, Vivek Goyal

commit v2.6.32-102-g8682e1f "blkio: Provide some isolation between groups" break
fast switching between task and journal-thread for very common write-fsync workload.
cfq wait idle slice at each cfqq switch, if this task is from non-root blkio cgroup.

This patch move idling sync-noidle preempting check little bit upwards and update
new service_tree->count check for case with two different groups.
I do not quite understand what means these check for new_cfqq, but now it even works.

Without patch I got 49 iops and with this patch 798, for this trivial fio script:

[write-fsync]
cgroup=test
cgroup_weight=1000
rw=write
fsync=1
size=100m
runtime=10s

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
---
 block/cfq-iosched.c |   14 +++++++-------
 1 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3c7b537..c71533e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3318,19 +3318,19 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
 		return true;
 
-	if (new_cfqq->cfqg != cfqq->cfqg)
-		return false;
-
-	if (cfq_slice_used(cfqq))
-		return true;
-
 	/* Allow preemption only if we are idling on sync-noidle tree */
 	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
 	    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
-	    new_cfqq->service_tree->count == 2 &&
+	    new_cfqq->service_tree->count == 1+(new_cfqq->cfqg == cfqq->cfqg) &&
 	    RB_EMPTY_ROOT(&cfqq->sort_list))
 		return true;
 
+	if (new_cfqq->cfqg != cfqq->cfqg)
+		return false;
+
+	if (cfq_slice_used(cfqq))
+		return true;
+
 	/*
 	 * So both queues are sync. Let the new request get disk time if
 	 * it's a metadata request and the current queue is doing regular IO.


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] cfq-iosched: allow groups preemption for sync-noidle workloads
  2011-06-23 16:21 [PATCH] cfq-iosched: allow groups preemption for sync-noidle workloads Konstantin Khlebnikov
@ 2011-06-23 17:26 ` Vivek Goyal
  2011-06-23 18:38   ` Jeff Moyer
  2011-06-23 18:08 ` Vivek Goyal
  2011-06-23 20:34 ` Vivek Goyal
  2 siblings, 1 reply; 6+ messages in thread
From: Vivek Goyal @ 2011-06-23 17:26 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Jens Axboe, linux-kernel, Moyer Jeff Moyer

On Thu, Jun 23, 2011 at 08:21:59PM +0400, Konstantin Khlebnikov wrote:
> commit v2.6.32-102-g8682e1f "blkio: Provide some isolation between groups" break
> fast switching between task and journal-thread for very common write-fsync workload.
> cfq wait idle slice at each cfqq switch, if this task is from non-root blkio cgroup.
> 
> This patch move idling sync-noidle preempting check little bit upwards and update
> new service_tree->count check for case with two different groups.
> I do not quite understand what means these check for new_cfqq, but now it even works.
> 
> Without patch I got 49 iops and with this patch 798, for this trivial fio script:
> 
> [write-fsync]
> cgroup=test
> cgroup_weight=1000
> rw=write
> fsync=1
> size=100m
> runtime=10s
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> ---
>  block/cfq-iosched.c |   14 +++++++-------
>  1 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 3c7b537..c71533e 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3318,19 +3318,19 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>  	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
>  		return true;
>  
> -	if (new_cfqq->cfqg != cfqq->cfqg)
> -		return false;
> -
> -	if (cfq_slice_used(cfqq))
> -		return true;
> -
>  	/* Allow preemption only if we are idling on sync-noidle tree */
>  	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
>  	    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
> -	    new_cfqq->service_tree->count == 2 &&
> +	    new_cfqq->service_tree->count == 1+(new_cfqq->cfqg == cfqq->cfqg) &&

I think this will completely break the isolation between groups. So now
your a cgroup might be serving a cfqq from SYNC_NOIDLE_WORKLOAD and a
SYNC_NOIDLE_WORKLOAD queue gets queued in a separate group, it will
immediately preempt queue in other group.

This problem is arising due to dependency between fsync and journalling
threads which are in different cgroups.

We had similar problem in root group also when one thread will show up
on sync-idle tree and other thread will show up on sync-noidle tree
and idling will kill throughput. I can't seem to remember how did we
fix that. I know Jeff moyer was working on some slice yield patches
but that never made in.  This patch also will not help if we run into
this situation when a dependent thread is on a sync-idle tree.

I guess we need a way to know the dependency between IO operations
and if some dependent IO operation is waiting in other group and
existing group has no IO to do, we can afford not to idle.

Jeff, would you remember how did we fix the fsync issue?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] cfq-iosched: allow groups preemption for sync-noidle workloads
  2011-06-23 16:21 [PATCH] cfq-iosched: allow groups preemption for sync-noidle workloads Konstantin Khlebnikov
  2011-06-23 17:26 ` Vivek Goyal
@ 2011-06-23 18:08 ` Vivek Goyal
  2011-06-23 20:34 ` Vivek Goyal
  2 siblings, 0 replies; 6+ messages in thread
From: Vivek Goyal @ 2011-06-23 18:08 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Jens Axboe, linux-kernel

On Thu, Jun 23, 2011 at 08:21:59PM +0400, Konstantin Khlebnikov wrote:
> commit v2.6.32-102-g8682e1f "blkio: Provide some isolation between groups" break
> fast switching between task and journal-thread for very common write-fsync workload.
> cfq wait idle slice at each cfqq switch, if this task is from non-root blkio cgroup.
> 
> This patch move idling sync-noidle preempting check little bit upwards and update
> new service_tree->count check for case with two different groups.
> I do not quite understand what means these check for new_cfqq, but now it even works.
> 
> Without patch I got 49 iops and with this patch 798, for this trivial fio script:

Which filesystem you are using for these tests?

Vivek

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] cfq-iosched: allow groups preemption for sync-noidle workloads
  2011-06-23 17:26 ` Vivek Goyal
@ 2011-06-23 18:38   ` Jeff Moyer
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff Moyer @ 2011-06-23 18:38 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Konstantin Khlebnikov, Jens Axboe, linux-kernel

Vivek Goyal <vgoyal@redhat.com> writes:

> On Thu, Jun 23, 2011 at 08:21:59PM +0400, Konstantin Khlebnikov wrote:
>> commit v2.6.32-102-g8682e1f "blkio: Provide some isolation between groups" break
>> fast switching between task and journal-thread for very common write-fsync workload.
>> cfq wait idle slice at each cfqq switch, if this task is from non-root blkio cgroup.
>> 
>> This patch move idling sync-noidle preempting check little bit upwards and update
>> new service_tree->count check for case with two different groups.
>> I do not quite understand what means these check for new_cfqq, but now it even works.
>> 
>> Without patch I got 49 iops and with this patch 798, for this trivial fio script:
>> 
>> [write-fsync]
>> cgroup=test
>> cgroup_weight=1000
>> rw=write
>> fsync=1
>> size=100m
>> runtime=10s
>> 
>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
>> ---
>>  block/cfq-iosched.c |   14 +++++++-------
>>  1 files changed, 7 insertions(+), 7 deletions(-)
>> 
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 3c7b537..c71533e 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -3318,19 +3318,19 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>>  	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
>>  		return true;
>>  
>> -	if (new_cfqq->cfqg != cfqq->cfqg)
>> -		return false;
>> -
>> -	if (cfq_slice_used(cfqq))
>> -		return true;
>> -
>>  	/* Allow preemption only if we are idling on sync-noidle tree */
>>  	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
>>  	    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
>> -	    new_cfqq->service_tree->count == 2 &&
>> +	    new_cfqq->service_tree->count == 1+(new_cfqq->cfqg == cfqq->cfqg) &&
>
> I think this will completely break the isolation between groups. So now
> your a cgroup might be serving a cfqq from SYNC_NOIDLE_WORKLOAD and a
> SYNC_NOIDLE_WORKLOAD queue gets queued in a separate group, it will
> immediately preempt queue in other group.
>
> This problem is arising due to dependency between fsync and journalling
> threads which are in different cgroups.
>
> We had similar problem in root group also when one thread will show up
> on sync-idle tree and other thread will show up on sync-noidle tree
> and idling will kill throughput. I can't seem to remember how did we
> fix that. I know Jeff moyer was working on some slice yield patches
> but that never made in.  This patch also will not help if we run into
> this situation when a dependent thread is on a sync-idle tree.
>
> I guess we need a way to know the dependency between IO operations
> and if some dependent IO operation is waiting in other group and
> existing group has no IO to do, we can afford not to idle.
>
> Jeff, would you remember how did we fix the fsync issue?

In the absence of cgroups, it was this patch:

commit 749ef9f8423054e326f3a246327ed2db4b6d395f
Author: Corrado Zoccolo <czoccolo@gmail.com>
Date:   Mon Sep 20 15:24:50 2010 +0200

    cfq: improve fsync performance for small files
    
    Fsync performance for small files achieved by cfq on high-end disks is
    lower than what deadline can achieve, due to idling introduced between
    the sync write happening in process context and the journal commit.
    
    Moreover, when competing with a sequential reader, a process writing
    small files and fsync-ing them is starved.
    
    This patch fixes the two problems by:
    - marking journal commits as WRITE_SYNC, so that they get the REQ_NOIDLE
      flag set,
    - force all queues that have REQ_NOIDLE requests to be put in the noidle
      tree.
    
    Having the queue associated to the fsync-ing process and the one associated
     to journal commits in the noidle tree allows:
    - switching between them without idling,
    - fairness vs. competing idling queues, since they will be serviced only
      after the noidle tree expires its slice.
    
    Acked-by: Vivek Goyal <vgoyal@redhat.com>
    Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
    Tested-by: Jeff Moyer <jmoyer@redhat.com>
    Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com>
    Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

If you need a mechanism to explicitly yield the I/O scheduler, I had
written several iterations of patches to provide that functionality.  I
believe this was the latest variant:
  https://lkml.org/lkml/2010/7/2/277

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] cfq-iosched: allow groups preemption for sync-noidle workloads
  2011-06-23 16:21 [PATCH] cfq-iosched: allow groups preemption for sync-noidle workloads Konstantin Khlebnikov
  2011-06-23 17:26 ` Vivek Goyal
  2011-06-23 18:08 ` Vivek Goyal
@ 2011-06-23 20:34 ` Vivek Goyal
  2011-06-24 10:29   ` Konstantin Khlebnikov
  2 siblings, 1 reply; 6+ messages in thread
From: Vivek Goyal @ 2011-06-23 20:34 UTC (permalink / raw)
  To: Konstantin Khlebnikov; +Cc: Jens Axboe, linux-kernel

On Thu, Jun 23, 2011 at 08:21:59PM +0400, Konstantin Khlebnikov wrote:
> commit v2.6.32-102-g8682e1f "blkio: Provide some isolation between groups" break
> fast switching between task and journal-thread for very common write-fsync workload.
> cfq wait idle slice at each cfqq switch, if this task is from non-root blkio cgroup.
> 
> This patch move idling sync-noidle preempting check little bit upwards and update
> new service_tree->count check for case with two different groups.
> I do not quite understand what means these check for new_cfqq, but now it even works.
> 
> Without patch I got 49 iops and with this patch 798, for this trivial fio script:
> 
> [write-fsync]
> cgroup=test
> cgroup_weight=1000
> rw=write
> fsync=1
> size=100m
> runtime=10s

What kind of storage and filesystem you are using? I tried this on a SATA
disk and I really don't get good throughput. With deadline scheduler I
get aggrb=103KB/s.

I think with fsync we are generating so many FLUSH requests that it 
really slows down fsync.

Even if I use CFQ with and without cgroups, I get following.

CFQ, without cgroup
------------------
aggrb=100KB/s

CFQ with cgroup
--------------
aggrb=94KB/s

So with FLUSH requests, not much difference in throughput for this 
workload.

I guess you must be running with barriers off or something like that.

Thanks
Vivek


> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
> ---
>  block/cfq-iosched.c |   14 +++++++-------
>  1 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 3c7b537..c71533e 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3318,19 +3318,19 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>  	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
>  		return true;
>  
> -	if (new_cfqq->cfqg != cfqq->cfqg)
> -		return false;
> -
> -	if (cfq_slice_used(cfqq))
> -		return true;
> -
>  	/* Allow preemption only if we are idling on sync-noidle tree */
>  	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD &&
>  	    cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD &&
> -	    new_cfqq->service_tree->count == 2 &&
> +	    new_cfqq->service_tree->count == 1+(new_cfqq->cfqg == cfqq->cfqg) &&
>  	    RB_EMPTY_ROOT(&cfqq->sort_list))
>  		return true;
>  
> +	if (new_cfqq->cfqg != cfqq->cfqg)
> +		return false;
> +
> +	if (cfq_slice_used(cfqq))
> +		return true;
> +
>  	/*
>  	 * So both queues are sync. Let the new request get disk time if
>  	 * it's a metadata request and the current queue is doing regular IO.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] cfq-iosched: allow groups preemption for sync-noidle workloads
  2011-06-23 20:34 ` Vivek Goyal
@ 2011-06-24 10:29   ` Konstantin Khlebnikov
  0 siblings, 0 replies; 6+ messages in thread
From: Konstantin Khlebnikov @ 2011-06-24 10:29 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Jens Axboe, linux-kernel@vger.kernel.org

Vivek Goyal wrote:
> On Thu, Jun 23, 2011 at 08:21:59PM +0400, Konstantin Khlebnikov wrote:
>> commit v2.6.32-102-g8682e1f "blkio: Provide some isolation between groups" break
>> fast switching between task and journal-thread for very common write-fsync workload.
>> cfq wait idle slice at each cfqq switch, if this task is from non-root blkio cgroup.
>>
>> This patch move idling sync-noidle preempting check little bit upwards and update
>> new service_tree->count check for case with two different groups.
>> I do not quite understand what means these check for new_cfqq, but now it even works.
>>
>> Without patch I got 49 iops and with this patch 798, for this trivial fio script:
>>
>> [write-fsync]
>> cgroup=test
>> cgroup_weight=1000
>> rw=write
>> fsync=1
>> size=100m
>> runtime=10s
>
> What kind of storage and filesystem you are using? I tried this on a SATA
> disk and I really don't get good throughput. With deadline scheduler I
> get aggrb=103KB/s.
>
> I think with fsync we are generating so many FLUSH requests that it
> really slows down fsync.
>
> Even if I use CFQ with and without cgroups, I get following.
>
> CFQ, without cgroup
> ------------------
> aggrb=100KB/s
>
> CFQ with cgroup
> --------------
> aggrb=94KB/s
>
> So with FLUSH requests, not much difference in throughput for this
> workload.
>
> I guess you must be running with barriers off or something like that.

Yes, it was ext4 on sata hdd without barriers, seems like ssd are not affected,
at least my intel x25m-g2. But I have problem report at openvz bugzilla where
this bug appears even with barriers on some cool server hardware:
http://bugzilla.openvz.org/show_bug.cgi?id=1913

>
> Thanks
> Vivek
>
>
>>
>> Signed-off-by: Konstantin Khlebnikov<khlebnikov@openvz.org>
>> ---
>>   block/cfq-iosched.c |   14 +++++++-------
>>   1 files changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 3c7b537..c71533e 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -3318,19 +3318,19 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
>>   	if (rq_is_sync(rq)&&  !cfq_cfqq_sync(cfqq))
>>   		return true;
>>
>> -	if (new_cfqq->cfqg != cfqq->cfqg)
>> -		return false;
>> -
>> -	if (cfq_slice_used(cfqq))
>> -		return true;
>> -
>>   	/* Allow preemption only if we are idling on sync-noidle tree */
>>   	if (cfqd->serving_type == SYNC_NOIDLE_WORKLOAD&&
>>   	cfqq_type(new_cfqq) == SYNC_NOIDLE_WORKLOAD&&
>> -	    new_cfqq->service_tree->count == 2&&
>> +	    new_cfqq->service_tree->count == 1+(new_cfqq->cfqg == cfqq->cfqg)&&
>>   	RB_EMPTY_ROOT(&cfqq->sort_list))
>>   		return true;
>>
>> +	if (new_cfqq->cfqg != cfqq->cfqg)
>> +		return false;
>> +
>> +	if (cfq_slice_used(cfqq))
>> +		return true;
>> +
>>   	/*
>>   	 * So both queues are sync. Let the new request get disk time if
>>   	 * it's a metadata request and the current queue is doing regular IO.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-06-24 10:29 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-23 16:21 [PATCH] cfq-iosched: allow groups preemption for sync-noidle workloads Konstantin Khlebnikov
2011-06-23 17:26 ` Vivek Goyal
2011-06-23 18:38   ` Jeff Moyer
2011-06-23 18:08 ` Vivek Goyal
2011-06-23 20:34 ` Vivek Goyal
2011-06-24 10:29   ` Konstantin Khlebnikov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox