public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* FIO: kjournald blocked for more than 120 seconds
@ 2008-06-16  2:04 Lin Ming
  2008-06-16 19:29 ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Lin Ming @ 2008-06-16  2:04 UTC (permalink / raw)
  To: jens.axboe; +Cc: Zhang, Yanmin, Linux Kernel Mailing List

Hi, Jens

When runnig FIO benchmark, kjournald blocked for more than 120 seconds.
Detailed root cause analysis and proposed solutions as below.

Any comment is appreciated.

Hardware Environment
---------------------
13 SEAGATE ST373307FC disks in a JBOD, connected by a Qlogic ISP2312
Fibe Channel HBA.

Bug description
----------------
fio vsync random read 4K in 13 disks, 4 processes per disk, fio global
paramter as below,
[global]
direct=0
ioengine=vsync
iodepth=256
iodepth_batch=32
size=2G
bs=4k
numjobs=2
loops=5
runtime=1200
group_reporting

Tested 4 IO schedulers, issue is only seen in CFQ.

INFO: task kjournald:20558 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
kjournald     D ffff810010820978  6712 20558      2
ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
Call Trace:
[<ffffffff803ba6f2>] kobject_get+0x12/0x17
[<ffffffff80247537>] getnstimeofday+0x2f/0x83
[<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
[<ffffffff8066d195>] io_schedule+0x5d/0x9f
[<ffffffff8029c1e7>] sync_buffer+0x3b/0x3f
[<ffffffff8066d3f0>] __wait_on_bit+0x40/0x6f
[<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
[<ffffffff8066d48b>] out_of_line_wait_on_bit+0x6c/0x78
[<ffffffff80243909>] wake_bit_function+0x0/0x23
[<ffffffff8029e3ad>] sync_dirty_buffer+0x98/0xcb
[<ffffffff8030056b>] journal_commit_transaction+0x97d/0xcb6
[<ffffffff8023a676>] lock_timer_base+0x26/0x4b
[<ffffffff8030300a>] kjournald+0xc1/0x1fb
[<ffffffff802438db>] autoremove_wake_function+0x0/0x2e
[<ffffffff80302f49>] kjournald+0x0/0x1fb
[<ffffffff802437bb>] kthread+0x47/0x74
[<ffffffff8022de51>] schedule_tail+0x28/0x5d
[<ffffffff8020cac8>] child_rip+0xa/0x12
[<ffffffff80243774>] kthread+0x0/0x74
[<ffffffff8020cabe>] child_rip+0x0/0x12

Other people reported the similar issue on LKML. 
http://lkml.org/lkml/2008/3/26/494 

I did lots of kernel instrumentation and testing, and located the root
cause.

Root cause analysis
-------------------

kjournald is blocked at: 
journal_write_commit_record -> 
sync_dirty_buffer(bh) -> 
wait_on_buffer(bh)

Buffer is submitted but it never finished.

git bisect to below patch,

commit cc19747977824ece6aa1c56a29e974fef5ec2b32
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Fri Apr 20 20:45:39 2007 +0200

    cfq-iosched: tighten queue request overlap condition
    
    For tagged devices, allow overlap of requests if the idle window
    isn't enabled on the current active queue.
    
    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a8237be..e859b49 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -989,7 +989,8 @@ static struct cfq_queue *cfq_select_queue(struct
cfq_data *cfqd)
* flight or is idling for a new request, allow either of these
* conditions to happen (or time out) before selecting a new queue.
*/
- if (cfqq->dispatched || timer_pending(&cfqd->idle_slice_timer)) {
+ if (timer_pending(&cfqd->idle_slice_timer) ||
+     (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
cfqq = NULL;
goto keep_queue;
}

The disks of my testing machine are tagged devices, so the CFQ idle
window is disabled. In other words, the active queue of tagged
devices(cfqd->hw_tag=1) never idle for a new request.

This causes active queue be expired immediately if it's empty, although
it has not run out of time. CFQ will select next queue as active queue.
In this testcase, there are thousands of FIO read requests in sync
queues, only a few write requests by journal_write_commit_record in
async queues.

In the other hand, all processes use the default io class and priority.
They share the async queue for the same device, but have their own sync
queue, so the sync queue number is 4 while asyn queue number is just 1
for the same device.

So sync queue has much more chances be selected as new active queue than
async queue.

Sync queues do not idle and they are dispatched all the time. This leads
to many unfinished requests in external queue, 
namely, cfqd->sync_flight > 0.

static int cfq_dispatch_requests (...) {
	....
	while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
	....
	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
		break;
		....
		__cfq_dispatch_requests(cfqq)
	}
	....
}

When cfq_select_queue selects the async queue which includes kjournald's
write request, this selected async queue will never be dispatched since
cfqd->sync_flight > 0, so kjournald is blocked.

Proposed 3 solutions
------------------
1. Do not check cfqd->sync_flight

-               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
-                       break;

2. If we do need to check cfqd->sync_flight, then for tagged devices, we
should give a little more chances to async queue to be dispatched.

@@ -1102,7 +1102,7 @@ static int cfq_dispatch_requests(struct
request_queue *q, int force)
                                break;
                }

-               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
+               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq) && !
cfqd->hw_tag)
                        break;

3. Force write request issued by journal_write_commit_record as sync
request. As a matter of fact, it looks like most write requests
submitted by kjournald is async request. We need convert them to sync
requests.

Cheers,
Lin Ming







^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: FIO: kjournald blocked for more than 120 seconds
  2008-06-16  2:04 FIO: kjournald blocked for more than 120 seconds Lin Ming
@ 2008-06-16 19:29 ` Jens Axboe
  2008-06-17  1:40   ` Zhang, Yanmin
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2008-06-16 19:29 UTC (permalink / raw)
  To: Lin Ming; +Cc: Zhang, Yanmin, Linux Kernel Mailing List

On Mon, Jun 16 2008, Lin Ming wrote:
> Hi, Jens
> 
> When runnig FIO benchmark, kjournald blocked for more than 120 seconds.
> Detailed root cause analysis and proposed solutions as below.
> 
> Any comment is appreciated.
> 
> Hardware Environment
> ---------------------
> 13 SEAGATE ST373307FC disks in a JBOD, connected by a Qlogic ISP2312
> Fibe Channel HBA.
> 
> Bug description
> ----------------
> fio vsync random read 4K in 13 disks, 4 processes per disk, fio global
> paramter as below,
> [global]
> direct=0
> ioengine=vsync
> iodepth=256
> iodepth_batch=32
> size=2G
> bs=4k
> numjobs=2
> loops=5
> runtime=1200
> group_reporting
> 
> Tested 4 IO schedulers, issue is only seen in CFQ.
> 
> INFO: task kjournald:20558 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> kjournald     D ffff810010820978  6712 20558      2
> ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
> ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
> 0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
> Call Trace:
> [<ffffffff803ba6f2>] kobject_get+0x12/0x17
> [<ffffffff80247537>] getnstimeofday+0x2f/0x83
> [<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
> [<ffffffff8066d195>] io_schedule+0x5d/0x9f
> [<ffffffff8029c1e7>] sync_buffer+0x3b/0x3f
> [<ffffffff8066d3f0>] __wait_on_bit+0x40/0x6f
> [<ffffffff8029c1ac>] sync_buffer+0x0/0x3f
> [<ffffffff8066d48b>] out_of_line_wait_on_bit+0x6c/0x78
> [<ffffffff80243909>] wake_bit_function+0x0/0x23
> [<ffffffff8029e3ad>] sync_dirty_buffer+0x98/0xcb
> [<ffffffff8030056b>] journal_commit_transaction+0x97d/0xcb6
> [<ffffffff8023a676>] lock_timer_base+0x26/0x4b
> [<ffffffff8030300a>] kjournald+0xc1/0x1fb
> [<ffffffff802438db>] autoremove_wake_function+0x0/0x2e
> [<ffffffff80302f49>] kjournald+0x0/0x1fb
> [<ffffffff802437bb>] kthread+0x47/0x74
> [<ffffffff8022de51>] schedule_tail+0x28/0x5d
> [<ffffffff8020cac8>] child_rip+0xa/0x12
> [<ffffffff80243774>] kthread+0x0/0x74
> [<ffffffff8020cabe>] child_rip+0x0/0x12
> 
> Other people reported the similar issue on LKML. 
> http://lkml.org/lkml/2008/3/26/494 
> 
> I did lots of kernel instrumentation and testing, and located the root
> cause.
> 
> Root cause analysis
> -------------------
> 
> kjournald is blocked at: 
> journal_write_commit_record -> 
> sync_dirty_buffer(bh) -> 
> wait_on_buffer(bh)
> 
> Buffer is submitted but it never finished.
> 
> git bisect to below patch,
> 
> commit cc19747977824ece6aa1c56a29e974fef5ec2b32
> Author: Jens Axboe <jens.axboe@oracle.com>
> Date:   Fri Apr 20 20:45:39 2007 +0200
> 
>     cfq-iosched: tighten queue request overlap condition
>     
>     For tagged devices, allow overlap of requests if the idle window
>     isn't enabled on the current active queue.
>     
>     Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index a8237be..e859b49 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -989,7 +989,8 @@ static struct cfq_queue *cfq_select_queue(struct
> cfq_data *cfqd)
> * flight or is idling for a new request, allow either of these
> * conditions to happen (or time out) before selecting a new queue.
> */
> - if (cfqq->dispatched || timer_pending(&cfqd->idle_slice_timer)) {
> + if (timer_pending(&cfqd->idle_slice_timer) ||
> +     (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
> cfqq = NULL;
> goto keep_queue;
> }
> 
> The disks of my testing machine are tagged devices, so the CFQ idle
> window is disabled. In other words, the active queue of tagged
> devices(cfqd->hw_tag=1) never idle for a new request.
> 
> This causes active queue be expired immediately if it's empty, although
> it has not run out of time. CFQ will select next queue as active queue.
> In this testcase, there are thousands of FIO read requests in sync
> queues, only a few write requests by journal_write_commit_record in
> async queues.
> 
> In the other hand, all processes use the default io class and priority.
> They share the async queue for the same device, but have their own sync
> queue, so the sync queue number is 4 while asyn queue number is just 1
> for the same device.
> 
> So sync queue has much more chances be selected as new active queue than
> async queue.
> 
> Sync queues do not idle and they are dispatched all the time. This leads
> to many unfinished requests in external queue, 
> namely, cfqd->sync_flight > 0.
> 
> static int cfq_dispatch_requests (...) {
> 	....
> 	while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
> 	....
> 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> 		break;
> 		....
> 		__cfq_dispatch_requests(cfqq)
> 	}
> 	....
> }
> 
> When cfq_select_queue selects the async queue which includes kjournald's
> write request, this selected async queue will never be dispatched since
> cfqd->sync_flight > 0, so kjournald is blocked.
> 
> Proposed 3 solutions
> ------------------
> 1. Do not check cfqd->sync_flight
> 
> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> -                       break;
> 
> 2. If we do need to check cfqd->sync_flight, then for tagged devices, we
> should give a little more chances to async queue to be dispatched.
> 
> @@ -1102,7 +1102,7 @@ static int cfq_dispatch_requests(struct
> request_queue *q, int force)
>                                 break;
>                 }
> 
> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> +               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq) && !
> cfqd->hw_tag)
>                         break;
> 
> 3. Force write request issued by journal_write_commit_record as sync
> request. As a matter of fact, it looks like most write requests
> submitted by kjournald is async request. We need convert them to sync
> requests.

Thanks for the very detailed analysis of the problem, complete with
suggestions. While I think that any code that does:

        submit async io
        wait for it

should be issuing sync IO (or, better, automatically upgrade the request
from async -> sync), we cannot rely on that.

This problem is similar in nature to device starvation, and a classic
solution to that problem is to issue occasional ordered tags to prevent
indefinite starvation. Perhaps we can apply some similar logic here.

For 2.6.26, the simple approach of just removing the sync_flight check
is probably the safest.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: FIO: kjournald blocked for more than 120 seconds
  2008-06-16 19:29 ` Jens Axboe
@ 2008-06-17  1:40   ` Zhang, Yanmin
  2008-06-17  8:36     ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Zhang, Yanmin @ 2008-06-17  1:40 UTC (permalink / raw)
  To: Jens Axboe, Lin, Ming M; +Cc: Linux Kernel Mailing List

>>-----Original Message-----
>>From: Jens Axboe [mailto:jens.axboe@oracle.com]
>>Sent: Tuesday, June 17, 2008 3:30 AM
>>To: Lin, Ming M
>>Cc: Zhang, Yanmin; Linux Kernel Mailing List
>>Subject: Re: FIO: kjournald blocked for more than 120 seconds
>>
>>On Mon, Jun 16 2008, Lin Ming wrote:
>>> Hi, Jens
>>>
>>> When runnig FIO benchmark, kjournald blocked for more than 120
seconds.
>>> Detailed root cause analysis and proposed solutions as below.
>>>
>>> Any comment is appreciated.
>>>
>>> Hardware Environment
>>> ---------------------
>>> 13 SEAGATE ST373307FC disks in a JBOD, connected by a Qlogic ISP2312
>>> Fibe Channel HBA.
>>>
>>> Bug description
>>> ----------------
>>> fio vsync random read 4K in 13 disks, 4 processes per disk, fio
global
>>> paramter as below,
>>> Tested 4 IO schedulers, issue is only seen in CFQ.
>>>
>>> INFO: task kjournald:20558 blocked for more than 120 seconds.
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>>> message.
>>> kjournald     D ffff810010820978  6712 20558      2
>>> ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
>>> ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
>>> 0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
>>> Call Trace:
>>> [<ffffffff803ba6f2>] kobject_get+0x12/0x17
>>> The disks of my testing machine are tagged devices, so the CFQ idle
>>> window is disabled. In other words, the active queue of tagged
>>> devices(cfqd->hw_tag=1) never idle for a new request.
>>>
>>> This causes active queue be expired immediately if it's empty,
although
>>> it has not run out of time. CFQ will select next queue as active
queue.
>>> In this testcase, there are thousands of FIO read requests in sync
>>> queues, only a few write requests by journal_write_commit_record in
>>> async queues.
>>>
>>> In the other hand, all processes use the default io class and
priority.
>>> They share the async queue for the same device, but have their own
sync
>>> queue, so the sync queue number is 4 while asyn queue number is just
1
>>> for the same device.
>>>
>>> So sync queue has much more chances be selected as new active queue
than
>>> async queue.
>>>
>>> Sync queues do not idle and they are dispatched all the time. This
leads
>>> to many unfinished requests in external queue,
>>> namely, cfqd->sync_flight > 0.
>>>
>>> static int cfq_dispatch_requests (...) {
>>> 	....
>>> 	while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
>>> 	....
>>> 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
>>> 		break;
>>> 		....
>>> 		__cfq_dispatch_requests(cfqq)
>>> 	}
>>> 	....
>>> }
>>>
>>> When cfq_select_queue selects the async queue which includes
kjournald's
>>> write request, this selected async queue will never be dispatched
since
>>> cfqd->sync_flight > 0, so kjournald is blocked.
>>>
>>> Proposed 3 solutions
>>> ------------------
>>> 1. Do not check cfqd->sync_flight
>>>
>>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
>>> -                       break;
>>>
>>> 2. If we do need to check cfqd->sync_flight, then for tagged
devices, we
>>> should give a little more chances to async queue to be dispatched.
>>>
>>> @@ -1102,7 +1102,7 @@ static int cfq_dispatch_requests(struct
>>> request_queue *q, int force)
>>>                                 break;
>>>                 }
>>>
>>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
>>> +               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq) && !
>>> cfqd->hw_tag)
>>>                         break;
>>>
>>> 3. Force write request issued by journal_write_commit_record as sync
>>> request. As a matter of fact, it looks like most write requests
>>> submitted by kjournald is async request. We need convert them to
sync
>>> requests.
>>
>>Thanks for the very detailed analysis of the problem, complete with
>>suggestions. While I think that any code that does:
>>
>>        submit async io
>>        wait for it
>>
>>should be issuing sync IO (or, better, automatically upgrade the
request
>>from async -> sync), we cannot rely on that.
[YM] We can talk case by case. We could convert some important async io
codes
 to sync io codes at least. For example, kjournald calls
sync_dirty_buffer what 
we captured in this case.

Another case is writeback. If processes do mmapped I/O and they might
stop in 
page fault to wait writeback finishing. Or a buffer write might trigger
a dirty 
page balance. As the latest kernel is more aggressive to start
writeback, it might 
be an issue now.

>>
>>This problem is similar in nature to device starvation, and a classic
>>solution to that problem is to issue occasional ordered tags to
prevent
>>indefinite starvation. Perhaps we can apply some similar logic here.
>>
>>For 2.6.26, the simple approach of just removing the sync_flight check
>>is probably the safest.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: FIO: kjournald blocked for more than 120 seconds
  2008-06-17  1:40   ` Zhang, Yanmin
@ 2008-06-17  8:36     ` Jens Axboe
  2008-06-17  9:02       ` Lin Ming
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2008-06-17  8:36 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Lin, Ming M, Linux Kernel Mailing List

On Tue, Jun 17 2008, Zhang, Yanmin wrote:
> >>-----Original Message-----
> >>From: Jens Axboe [mailto:jens.axboe@oracle.com]
> >>Sent: Tuesday, June 17, 2008 3:30 AM
> >>To: Lin, Ming M
> >>Cc: Zhang, Yanmin; Linux Kernel Mailing List
> >>Subject: Re: FIO: kjournald blocked for more than 120 seconds
> >>
> >>On Mon, Jun 16 2008, Lin Ming wrote:
> >>> Hi, Jens
> >>>
> >>> When runnig FIO benchmark, kjournald blocked for more than 120
> seconds.
> >>> Detailed root cause analysis and proposed solutions as below.
> >>>
> >>> Any comment is appreciated.
> >>>
> >>> Hardware Environment
> >>> ---------------------
> >>> 13 SEAGATE ST373307FC disks in a JBOD, connected by a Qlogic ISP2312
> >>> Fibe Channel HBA.
> >>>
> >>> Bug description
> >>> ----------------
> >>> fio vsync random read 4K in 13 disks, 4 processes per disk, fio
> global
> >>> paramter as below,
> >>> Tested 4 IO schedulers, issue is only seen in CFQ.
> >>>
> >>> INFO: task kjournald:20558 blocked for more than 120 seconds.
> >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> >>> message.
> >>> kjournald     D ffff810010820978  6712 20558      2
> >>> ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
> >>> ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
> >>> 0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
> >>> Call Trace:
> >>> [<ffffffff803ba6f2>] kobject_get+0x12/0x17
> >>> The disks of my testing machine are tagged devices, so the CFQ idle
> >>> window is disabled. In other words, the active queue of tagged
> >>> devices(cfqd->hw_tag=1) never idle for a new request.
> >>>
> >>> This causes active queue be expired immediately if it's empty,
> although
> >>> it has not run out of time. CFQ will select next queue as active
> queue.
> >>> In this testcase, there are thousands of FIO read requests in sync
> >>> queues, only a few write requests by journal_write_commit_record in
> >>> async queues.
> >>>
> >>> In the other hand, all processes use the default io class and
> priority.
> >>> They share the async queue for the same device, but have their own
> sync
> >>> queue, so the sync queue number is 4 while asyn queue number is just
> 1
> >>> for the same device.
> >>>
> >>> So sync queue has much more chances be selected as new active queue
> than
> >>> async queue.
> >>>
> >>> Sync queues do not idle and they are dispatched all the time. This
> leads
> >>> to many unfinished requests in external queue,
> >>> namely, cfqd->sync_flight > 0.
> >>>
> >>> static int cfq_dispatch_requests (...) {
> >>> 	....
> >>> 	while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
> >>> 	....
> >>> 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> >>> 		break;
> >>> 		....
> >>> 		__cfq_dispatch_requests(cfqq)
> >>> 	}
> >>> 	....
> >>> }
> >>>
> >>> When cfq_select_queue selects the async queue which includes
> kjournald's
> >>> write request, this selected async queue will never be dispatched
> since
> >>> cfqd->sync_flight > 0, so kjournald is blocked.
> >>>
> >>> Proposed 3 solutions
> >>> ------------------
> >>> 1. Do not check cfqd->sync_flight
> >>>
> >>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> >>> -                       break;
> >>>
> >>> 2. If we do need to check cfqd->sync_flight, then for tagged
> devices, we
> >>> should give a little more chances to async queue to be dispatched.
> >>>
> >>> @@ -1102,7 +1102,7 @@ static int cfq_dispatch_requests(struct
> >>> request_queue *q, int force)
> >>>                                 break;
> >>>                 }
> >>>
> >>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> >>> +               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq) && !
> >>> cfqd->hw_tag)
> >>>                         break;
> >>>
> >>> 3. Force write request issued by journal_write_commit_record as sync
> >>> request. As a matter of fact, it looks like most write requests
> >>> submitted by kjournald is async request. We need convert them to
> sync
> >>> requests.
> >>
> >>Thanks for the very detailed analysis of the problem, complete with
> >>suggestions. While I think that any code that does:
> >>
> >>        submit async io
> >>        wait for it
> >>
> >>should be issuing sync IO (or, better, automatically upgrade the
> request
> >>from async -> sync), we cannot rely on that.
> [YM] We can talk case by case. We could convert some important async io
> codes
>  to sync io codes at least. For example, kjournald calls
> sync_dirty_buffer what 
> we captured in this case.

I agree, we should fix the obvious cases. My point was merely that there
will probably always be missed cases, so we should attempt to handle it
in the scheduler as well. Does the below buffer patch make it any
better?

> Another case is writeback. If processes do mmapped I/O and they might
> stop in 
> page fault to wait writeback finishing. Or a buffer write might trigger
> a dirty 
> page balance. As the latest kernel is more aggressive to start
> writeback, it might 
> be an issue now.

Sync process getting stuck in async writeout is another problem of the
same variety.

diff --git a/fs/buffer.c b/fs/buffer.c
index a073f3f..1957a8f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2978,7 +2978,7 @@ int sync_dirty_buffer(struct buffer_head *bh)
 	if (test_clear_buffer_dirty(bh)) {
 		get_bh(bh);
 		bh->b_end_io = end_buffer_write_sync;
-		ret = submit_bh(WRITE, bh);
+		ret = submit_bh(WRITE_SYNC, bh);
 		wait_on_buffer(bh);
 		if (buffer_eopnotsupp(bh)) {
 			clear_buffer_eopnotsupp(bh);

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: FIO: kjournald blocked for more than 120 seconds
  2008-06-17  8:36     ` Jens Axboe
@ 2008-06-17  9:02       ` Lin Ming
  2008-06-17  9:31         ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Lin Ming @ 2008-06-17  9:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Zhang, Yanmin, Linux Kernel Mailing List


On Tue, 2008-06-17 at 10:36 +0200, Jens Axboe wrote:
> On Tue, Jun 17 2008, Zhang, Yanmin wrote:
> > >>-----Original Message-----
> > >>From: Jens Axboe [mailto:jens.axboe@oracle.com]
> > >>Sent: Tuesday, June 17, 2008 3:30 AM
> > >>To: Lin, Ming M
> > >>Cc: Zhang, Yanmin; Linux Kernel Mailing List
> > >>Subject: Re: FIO: kjournald blocked for more than 120 seconds
> > >>
> > >>On Mon, Jun 16 2008, Lin Ming wrote:
> > >>> Hi, Jens
> > >>>
> > >>> When runnig FIO benchmark, kjournald blocked for more than 120
> > seconds.
> > >>> Detailed root cause analysis and proposed solutions as below.
> > >>>
> > >>> Any comment is appreciated.
> > >>>
> > >>> Hardware Environment
> > >>> ---------------------
> > >>> 13 SEAGATE ST373307FC disks in a JBOD, connected by a Qlogic ISP2312
> > >>> Fibe Channel HBA.
> > >>>
> > >>> Bug description
> > >>> ----------------
> > >>> fio vsync random read 4K in 13 disks, 4 processes per disk, fio
> > global
> > >>> paramter as below,
> > >>> Tested 4 IO schedulers, issue is only seen in CFQ.
> > >>>
> > >>> INFO: task kjournald:20558 blocked for more than 120 seconds.
> > >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > >>> message.
> > >>> kjournald     D ffff810010820978  6712 20558      2
> > >>> ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
> > >>> ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
> > >>> 0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
> > >>> Call Trace:
> > >>> [<ffffffff803ba6f2>] kobject_get+0x12/0x17
> > >>> The disks of my testing machine are tagged devices, so the CFQ idle
> > >>> window is disabled. In other words, the active queue of tagged
> > >>> devices(cfqd->hw_tag=1) never idle for a new request.
> > >>>
> > >>> This causes active queue be expired immediately if it's empty,
> > although
> > >>> it has not run out of time. CFQ will select next queue as active
> > queue.
> > >>> In this testcase, there are thousands of FIO read requests in sync
> > >>> queues, only a few write requests by journal_write_commit_record in
> > >>> async queues.
> > >>>
> > >>> In the other hand, all processes use the default io class and
> > priority.
> > >>> They share the async queue for the same device, but have their own
> > sync
> > >>> queue, so the sync queue number is 4 while asyn queue number is just
> > 1
> > >>> for the same device.
> > >>>
> > >>> So sync queue has much more chances be selected as new active queue
> > than
> > >>> async queue.
> > >>>
> > >>> Sync queues do not idle and they are dispatched all the time. This
> > leads
> > >>> to many unfinished requests in external queue,
> > >>> namely, cfqd->sync_flight > 0.
> > >>>
> > >>> static int cfq_dispatch_requests (...) {
> > >>> 	....
> > >>> 	while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
> > >>> 	....
> > >>> 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > >>> 		break;
> > >>> 		....
> > >>> 		__cfq_dispatch_requests(cfqq)
> > >>> 	}
> > >>> 	....
> > >>> }
> > >>>
> > >>> When cfq_select_queue selects the async queue which includes
> > kjournald's
> > >>> write request, this selected async queue will never be dispatched
> > since
> > >>> cfqd->sync_flight > 0, so kjournald is blocked.
> > >>>
> > >>> Proposed 3 solutions
> > >>> ------------------
> > >>> 1. Do not check cfqd->sync_flight
> > >>>
> > >>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > >>> -                       break;
> > >>>
> > >>> 2. If we do need to check cfqd->sync_flight, then for tagged
> > devices, we
> > >>> should give a little more chances to async queue to be dispatched.
> > >>>
> > >>> @@ -1102,7 +1102,7 @@ static int cfq_dispatch_requests(struct
> > >>> request_queue *q, int force)
> > >>>                                 break;
> > >>>                 }
> > >>>
> > >>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > >>> +               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq) && !
> > >>> cfqd->hw_tag)
> > >>>                         break;
> > >>>
> > >>> 3. Force write request issued by journal_write_commit_record as sync
> > >>> request. As a matter of fact, it looks like most write requests
> > >>> submitted by kjournald is async request. We need convert them to
> > sync
> > >>> requests.
> > >>
> > >>Thanks for the very detailed analysis of the problem, complete with
> > >>suggestions. While I think that any code that does:
> > >>
> > >>        submit async io
> > >>        wait for it
> > >>
> > >>should be issuing sync IO (or, better, automatically upgrade the
> > request
> > >>from async -> sync), we cannot rely on that.
> > [YM] We can talk case by case. We could convert some important async io
> > codes
> >  to sync io codes at least. For example, kjournald calls
> > sync_dirty_buffer what 
> > we captured in this case.
> 
> I agree, we should fix the obvious cases. My point was merely that there
> will probably always be missed cases, so we should attempt to handle it
> in the scheduler as well. Does the below buffer patch make it any
> better?

Yes, kjournald blocked issue is gone with below patch applied.

Lin Ming

> 
> > Another case is writeback. If processes do mmapped I/O and they might
> > stop in 
> > page fault to wait writeback finishing. Or a buffer write might trigger
> > a dirty 
> > page balance. As the latest kernel is more aggressive to start
> > writeback, it might 
> > be an issue now.
> 
> Sync process getting stuck in async writeout is another problem of the
> same variety.
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index a073f3f..1957a8f 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2978,7 +2978,7 @@ int sync_dirty_buffer(struct buffer_head *bh)
>  	if (test_clear_buffer_dirty(bh)) {
>  		get_bh(bh);
>  		bh->b_end_io = end_buffer_write_sync;
> -		ret = submit_bh(WRITE, bh);
> +		ret = submit_bh(WRITE_SYNC, bh);
>  		wait_on_buffer(bh);
>  		if (buffer_eopnotsupp(bh)) {
>  			clear_buffer_eopnotsupp(bh);
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: FIO: kjournald blocked for more than 120 seconds
  2008-06-17  9:02       ` Lin Ming
@ 2008-06-17  9:31         ` Jens Axboe
  2008-06-26  9:33           ` Lin Ming
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2008-06-17  9:31 UTC (permalink / raw)
  To: Lin Ming; +Cc: Zhang, Yanmin, Linux Kernel Mailing List

On Tue, Jun 17 2008, Lin Ming wrote:
> 
> On Tue, 2008-06-17 at 10:36 +0200, Jens Axboe wrote:
> > On Tue, Jun 17 2008, Zhang, Yanmin wrote:
> > > >>-----Original Message-----
> > > >>From: Jens Axboe [mailto:jens.axboe@oracle.com]
> > > >>Sent: Tuesday, June 17, 2008 3:30 AM
> > > >>To: Lin, Ming M
> > > >>Cc: Zhang, Yanmin; Linux Kernel Mailing List
> > > >>Subject: Re: FIO: kjournald blocked for more than 120 seconds
> > > >>
> > > >>On Mon, Jun 16 2008, Lin Ming wrote:
> > > >>> Hi, Jens
> > > >>>
> > > >>> When runnig FIO benchmark, kjournald blocked for more than 120
> > > seconds.
> > > >>> Detailed root cause analysis and proposed solutions as below.
> > > >>>
> > > >>> Any comment is appreciated.
> > > >>>
> > > >>> Hardware Environment
> > > >>> ---------------------
> > > >>> 13 SEAGATE ST373307FC disks in a JBOD, connected by a Qlogic ISP2312
> > > >>> Fibe Channel HBA.
> > > >>>
> > > >>> Bug description
> > > >>> ----------------
> > > >>> fio vsync random read 4K in 13 disks, 4 processes per disk, fio
> > > global
> > > >>> paramter as below,
> > > >>> Tested 4 IO schedulers, issue is only seen in CFQ.
> > > >>>
> > > >>> INFO: task kjournald:20558 blocked for more than 120 seconds.
> > > >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > >>> message.
> > > >>> kjournald     D ffff810010820978  6712 20558      2
> > > >>> ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
> > > >>> ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
> > > >>> 0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
> > > >>> Call Trace:
> > > >>> [<ffffffff803ba6f2>] kobject_get+0x12/0x17
> > > >>> The disks of my testing machine are tagged devices, so the CFQ idle
> > > >>> window is disabled. In other words, the active queue of tagged
> > > >>> devices(cfqd->hw_tag=1) never idle for a new request.
> > > >>>
> > > >>> This causes active queue be expired immediately if it's empty,
> > > although
> > > >>> it has not run out of time. CFQ will select next queue as active
> > > queue.
> > > >>> In this testcase, there are thousands of FIO read requests in sync
> > > >>> queues, only a few write requests by journal_write_commit_record in
> > > >>> async queues.
> > > >>>
> > > >>> In the other hand, all processes use the default io class and
> > > priority.
> > > >>> They share the async queue for the same device, but have their own
> > > sync
> > > >>> queue, so the sync queue number is 4 while asyn queue number is just
> > > 1
> > > >>> for the same device.
> > > >>>
> > > >>> So sync queue has much more chances be selected as new active queue
> > > than
> > > >>> async queue.
> > > >>>
> > > >>> Sync queues do not idle and they are dispatched all the time. This
> > > leads
> > > >>> to many unfinished requests in external queue,
> > > >>> namely, cfqd->sync_flight > 0.
> > > >>>
> > > >>> static int cfq_dispatch_requests (...) {
> > > >>> 	....
> > > >>> 	while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
> > > >>> 	....
> > > >>> 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > >>> 		break;
> > > >>> 		....
> > > >>> 		__cfq_dispatch_requests(cfqq)
> > > >>> 	}
> > > >>> 	....
> > > >>> }
> > > >>>
> > > >>> When cfq_select_queue selects the async queue which includes
> > > kjournald's
> > > >>> write request, this selected async queue will never be dispatched
> > > since
> > > >>> cfqd->sync_flight > 0, so kjournald is blocked.
> > > >>>
> > > >>> Proposed 3 solutions
> > > >>> ------------------
> > > >>> 1. Do not check cfqd->sync_flight
> > > >>>
> > > >>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > >>> -                       break;
> > > >>>
> > > >>> 2. If we do need to check cfqd->sync_flight, then for tagged
> > > devices, we
> > > >>> should give a little more chances to async queue to be dispatched.
> > > >>>
> > > >>> @@ -1102,7 +1102,7 @@ static int cfq_dispatch_requests(struct
> > > >>> request_queue *q, int force)
> > > >>>                                 break;
> > > >>>                 }
> > > >>>
> > > >>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > >>> +               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq) && !
> > > >>> cfqd->hw_tag)
> > > >>>                         break;
> > > >>>
> > > >>> 3. Force write request issued by journal_write_commit_record as sync
> > > >>> request. As a matter of fact, it looks like most write requests
> > > >>> submitted by kjournald is async request. We need convert them to
> > > sync
> > > >>> requests.
> > > >>
> > > >>Thanks for the very detailed analysis of the problem, complete with
> > > >>suggestions. While I think that any code that does:
> > > >>
> > > >>        submit async io
> > > >>        wait for it
> > > >>
> > > >>should be issuing sync IO (or, better, automatically upgrade the
> > > request
> > > >>from async -> sync), we cannot rely on that.
> > > [YM] We can talk case by case. We could convert some important async io
> > > codes
> > >  to sync io codes at least. For example, kjournald calls
> > > sync_dirty_buffer what 
> > > we captured in this case.
> > 
> > I agree, we should fix the obvious cases. My point was merely that there
> > will probably always be missed cases, so we should attempt to handle it
> > in the scheduler as well. Does the below buffer patch make it any
> > better?
> 
> Yes, kjournald blocked issue is gone with below patch applied.

I think it's obviously the right thing to do, but I'm also a bit worried
about applying it so close to 2.6.26 release. OTOH, we need to do
SOMETHING for 2.6.26 release, so...

> 
> Lin Ming
> 
> > 
> > > Another case is writeback. If processes do mmapped I/O and they might
> > > stop in 
> > > page fault to wait writeback finishing. Or a buffer write might trigger
> > > a dirty 
> > > page balance. As the latest kernel is more aggressive to start
> > > writeback, it might 
> > > be an issue now.
> > 
> > Sync process getting stuck in async writeout is another problem of the
> > same variety.
> > 
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index a073f3f..1957a8f 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2978,7 +2978,7 @@ int sync_dirty_buffer(struct buffer_head *bh)
> >  	if (test_clear_buffer_dirty(bh)) {
> >  		get_bh(bh);
> >  		bh->b_end_io = end_buffer_write_sync;
> > -		ret = submit_bh(WRITE, bh);
> > +		ret = submit_bh(WRITE_SYNC, bh);
> >  		wait_on_buffer(bh);
> >  		if (buffer_eopnotsupp(bh)) {
> >  			clear_buffer_eopnotsupp(bh);
> > 
> 

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: FIO: kjournald blocked for more than 120 seconds
  2008-06-17  9:31         ` Jens Axboe
@ 2008-06-26  9:33           ` Lin Ming
  2008-06-26  9:49             ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Lin Ming @ 2008-06-26  9:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Zhang, Yanmin, Linux Kernel Mailing List

Jens,

How about below patch?

If cfqd->sync_flight > 0, we still allow async queue to be dispached,
and its slice time would be decreased. So sync queue never starve async
queue.

Introduce a sysfs interface cfq_slice_async_penalty to adjust the
percentages of slice that will be decreased.

Signed-off-by: Lin Ming <ming.m.lin@intel.com>
---
 block/cfq-iosched.c |   33 ++++++++++++++++++++++++++++++++-
 1 files changed, 32 insertions(+), 1 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index d01b411..8bdc931 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -26,6 +26,12 @@ static const int cfq_slice_sync = HZ / 10;
 static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
+/*
+ * penalty of dispatching async queue
+ * if there are unfinished requests in sync queue
+ * by default, 50% of time slice decreased
+ */
+static int cfq_slice_async_penalty = 50;
 
 /*
  * offset from end of service tree
@@ -114,6 +120,7 @@ struct cfq_data {
 	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
+	unsigned int cfq_slice_async_penalty;
 
 	struct list_head cic_list;
 };
@@ -288,6 +295,25 @@ static inline int cfq_slice_used(struct cfq_queue
*cfqq)
 }
 
 /*
+ * Get the slice time to be decreased
+ * when async queue is dispatched and
+ * there are unfinished requests in sync queue
+ */
+static inline int
+cfq_async_penalty(struct cfq_data *cfqd, struct cfq_queue *cfqq)
+{
+	int penalty;
+
+	if (cfq_cfqq_sync(cfqq))
+		return 0;
+
+	penalty = (cfqq->slice_end - jiffies) *
+			cfqd->cfq_slice_async_penalty / 100;
+
+	return penalty;
+}
+
+/*
  * Lifted from AS - choose which of rq1 and rq2 that is best served
now.
  * We choose the request that is closest to the head right now.
Distance
  * behind the head is penalized and only allowed to a certain extent.
@@ -1103,7 +1129,7 @@ static int cfq_dispatch_requests(struct
request_queue *q, int force)
 		}
 
 		if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
-			break;
+			cfqq->slice_end -= cfq_async_penalty(cfqd, cfqq);
 
 		cfq_clear_cfqq_must_dispatch(cfqq);
 		cfq_clear_cfqq_wait_request(cfqq);
@@ -2143,6 +2169,7 @@ static void *cfq_init_queue(struct request_queue
*q)
 	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
+	cfqd->cfq_slice_async_penalty = cfq_slice_async_penalty;
 
 	return cfqd;
 }
@@ -2211,6 +2238,7 @@ SHOW_FUNCTION(cfq_slice_idle_show,
cfqd->cfq_slice_idle, 1);
 SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
+SHOW_FUNCTION(cfq_slice_async_penalty_show,
cfqd->cfq_slice_async_penalty, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2242,6 +2270,8 @@ STORE_FUNCTION(cfq_slice_sync_store,
&cfqd->cfq_slice[1], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX,
1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
+STORE_FUNCTION(cfq_slice_async_penalty_store,
&cfqd->cfq_slice_async_penalty, 0,
+		100, 0);
 #undef STORE_FUNCTION
 
 #define CFQ_ATTR(name) \
@@ -2257,6 +2287,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
+	CFQ_ATTR(slice_async_penalty),
 	__ATTR_NULL
 };
 
Lin Ming

On Tue, 2008-06-17 at 11:31 +0200, Jens Axboe wrote:
> On Tue, Jun 17 2008, Lin Ming wrote:
> > 
> > On Tue, 2008-06-17 at 10:36 +0200, Jens Axboe wrote:
> > > On Tue, Jun 17 2008, Zhang, Yanmin wrote:
> > > > >>-----Original Message-----
> > > > >>From: Jens Axboe [mailto:jens.axboe@oracle.com]
> > > > >>Sent: Tuesday, June 17, 2008 3:30 AM
> > > > >>To: Lin, Ming M
> > > > >>Cc: Zhang, Yanmin; Linux Kernel Mailing List
> > > > >>Subject: Re: FIO: kjournald blocked for more than 120 seconds
> > > > >>
> > > > >>On Mon, Jun 16 2008, Lin Ming wrote:
> > > > >>> Hi, Jens
> > > > >>>
> > > > >>> When runnig FIO benchmark, kjournald blocked for more than 120
> > > > seconds.
> > > > >>> Detailed root cause analysis and proposed solutions as below.
> > > > >>>
> > > > >>> Any comment is appreciated.
> > > > >>>
> > > > >>> Hardware Environment
> > > > >>> ---------------------
> > > > >>> 13 SEAGATE ST373307FC disks in a JBOD, connected by a Qlogic ISP2312
> > > > >>> Fibe Channel HBA.
> > > > >>>
> > > > >>> Bug description
> > > > >>> ----------------
> > > > >>> fio vsync random read 4K in 13 disks, 4 processes per disk, fio
> > > > global
> > > > >>> paramter as below,
> > > > >>> Tested 4 IO schedulers, issue is only seen in CFQ.
> > > > >>>
> > > > >>> INFO: task kjournald:20558 blocked for more than 120 seconds.
> > > > >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > > >>> message.
> > > > >>> kjournald     D ffff810010820978  6712 20558      2
> > > > >>> ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
> > > > >>> ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
> > > > >>> 0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
> > > > >>> Call Trace:
> > > > >>> [<ffffffff803ba6f2>] kobject_get+0x12/0x17
> > > > >>> The disks of my testing machine are tagged devices, so the CFQ idle
> > > > >>> window is disabled. In other words, the active queue of tagged
> > > > >>> devices(cfqd->hw_tag=1) never idle for a new request.
> > > > >>>
> > > > >>> This causes active queue be expired immediately if it's empty,
> > > > although
> > > > >>> it has not run out of time. CFQ will select next queue as active
> > > > queue.
> > > > >>> In this testcase, there are thousands of FIO read requests in sync
> > > > >>> queues, only a few write requests by journal_write_commit_record in
> > > > >>> async queues.
> > > > >>>
> > > > >>> In the other hand, all processes use the default io class and
> > > > priority.
> > > > >>> They share the async queue for the same device, but have their own
> > > > sync
> > > > >>> queue, so the sync queue number is 4 while asyn queue number is just
> > > > 1
> > > > >>> for the same device.
> > > > >>>
> > > > >>> So sync queue has much more chances be selected as new active queue
> > > > than
> > > > >>> async queue.
> > > > >>>
> > > > >>> Sync queues do not idle and they are dispatched all the time. This
> > > > leads
> > > > >>> to many unfinished requests in external queue,
> > > > >>> namely, cfqd->sync_flight > 0.
> > > > >>>
> > > > >>> static int cfq_dispatch_requests (...) {
> > > > >>> 	....
> > > > >>> 	while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
> > > > >>> 	....
> > > > >>> 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > > >>> 		break;
> > > > >>> 		....
> > > > >>> 		__cfq_dispatch_requests(cfqq)
> > > > >>> 	}
> > > > >>> 	....
> > > > >>> }
> > > > >>>
> > > > >>> When cfq_select_queue selects the async queue which includes
> > > > kjournald's
> > > > >>> write request, this selected async queue will never be dispatched
> > > > since
> > > > >>> cfqd->sync_flight > 0, so kjournald is blocked.
> > > > >>>
> > > > >>> Proposed 3 solutions
> > > > >>> ------------------
> > > > >>> 1. Do not check cfqd->sync_flight
> > > > >>>
> > > > >>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > > >>> -                       break;
> > > > >>>
> > > > >>> 2. If we do need to check cfqd->sync_flight, then for tagged
> > > > devices, we
> > > > >>> should give a little more chances to async queue to be dispatched.
> > > > >>>
> > > > >>> @@ -1102,7 +1102,7 @@ static int cfq_dispatch_requests(struct
> > > > >>> request_queue *q, int force)
> > > > >>>                                 break;
> > > > >>>                 }
> > > > >>>
> > > > >>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > > >>> +               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq) && !
> > > > >>> cfqd->hw_tag)
> > > > >>>                         break;
> > > > >>>
> > > > >>> 3. Force write request issued by journal_write_commit_record as sync
> > > > >>> request. As a matter of fact, it looks like most write requests
> > > > >>> submitted by kjournald is async request. We need convert them to
> > > > sync
> > > > >>> requests.
> > > > >>
> > > > >>Thanks for the very detailed analysis of the problem, complete with
> > > > >>suggestions. While I think that any code that does:
> > > > >>
> > > > >>        submit async io
> > > > >>        wait for it
> > > > >>
> > > > >>should be issuing sync IO (or, better, automatically upgrade the
> > > > request
> > > > >>from async -> sync), we cannot rely on that.
> > > > [YM] We can talk case by case. We could convert some important async io
> > > > codes
> > > >  to sync io codes at least. For example, kjournald calls
> > > > sync_dirty_buffer what 
> > > > we captured in this case.
> > > 
> > > I agree, we should fix the obvious cases. My point was merely that there
> > > will probably always be missed cases, so we should attempt to handle it
> > > in the scheduler as well. Does the below buffer patch make it any
> > > better?
> > 
> > Yes, kjournald blocked issue is gone with below patch applied.
> 
> I think it's obviously the right thing to do, but I'm also a bit worried
> about applying it so close to 2.6.26 release. OTOH, we need to do
> SOMETHING for 2.6.26 release, so...
> 
> > 
> > Lin Ming
> > 
> > > 
> > > > Another case is writeback. If processes do mmapped I/O and they might
> > > > stop in 
> > > > page fault to wait writeback finishing. Or a buffer write might trigger
> > > > a dirty 
> > > > page balance. As the latest kernel is more aggressive to start
> > > > writeback, it might 
> > > > be an issue now.
> > > 
> > > Sync process getting stuck in async writeout is another problem of the
> > > same variety.
> > > 
> > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > index a073f3f..1957a8f 100644
> > > --- a/fs/buffer.c
> > > +++ b/fs/buffer.c
> > > @@ -2978,7 +2978,7 @@ int sync_dirty_buffer(struct buffer_head *bh)
> > >  	if (test_clear_buffer_dirty(bh)) {
> > >  		get_bh(bh);
> > >  		bh->b_end_io = end_buffer_write_sync;
> > > -		ret = submit_bh(WRITE, bh);
> > > +		ret = submit_bh(WRITE_SYNC, bh);
> > >  		wait_on_buffer(bh);
> > >  		if (buffer_eopnotsupp(bh)) {
> > >  			clear_buffer_eopnotsupp(bh);
> > > 
> > 
> 


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: FIO: kjournald blocked for more than 120 seconds
  2008-06-26  9:33           ` Lin Ming
@ 2008-06-26  9:49             ` Jens Axboe
  2008-06-27  8:09               ` Lin Ming
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2008-06-26  9:49 UTC (permalink / raw)
  To: Lin Ming; +Cc: Zhang, Yanmin, Linux Kernel Mailing List

On Thu, Jun 26 2008, Lin Ming wrote:
> Jens,
> 
> How about below patch?
> 
> If cfqd->sync_flight > 0, we still allow async queue to be dispached,
> and its slice time would be decreased. So sync queue never starve async
> queue.
> 
> Introduce a sysfs interface cfq_slice_async_penalty to adjust the
> percentages of slice that will be decreased.

I don't think it's a good change, for a variety of reasons:

- The reason we don't dispatch async with sync in flight, is to avoid
  interference of the async IO with the sync IO. Perhaps that is OK in
  this case, since elv_dispatch_sort() by default doesn't allow crossing
  read vs write boundaries.

- async queue is typically expired immediately, so the slice length
  doesn't factor into it so much.

- I don't think more tunables are going to help solve anything.

With the current check, we risk starving async IO in presence of lots of
sync IO. Without the check, we risk starving sync process in presence of
lots of async IO. So we need some sort of middle ground. Async IO is
penalized in the presence of lots of sync IO already. As the number of
sync queues grow, async IO gets a continually smaller part of the total
disk time. That isn't very fair to begin with.

For 2.6.26, how about adding the WRITE_SYNC change to buffer.c that we
discussed earlier? I've been using it here since and it's a less risky
change. Then for 2.6.27 we can look into tweaking CFQ.



> Signed-off-by: Lin Ming <ming.m.lin@intel.com>
> ---
>  block/cfq-iosched.c |   33 ++++++++++++++++++++++++++++++++-
>  1 files changed, 32 insertions(+), 1 deletions(-)
> 
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index d01b411..8bdc931 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -26,6 +26,12 @@ static const int cfq_slice_sync = HZ / 10;
>  static int cfq_slice_async = HZ / 25;
>  static const int cfq_slice_async_rq = 2;
>  static int cfq_slice_idle = HZ / 125;
> +/*
> + * penalty of dispatching async queue
> + * if there are unfinished requests in sync queue
> + * by default, 50% of time slice decreased
> + */
> +static int cfq_slice_async_penalty = 50;
>  
>  /*
>   * offset from end of service tree
> @@ -114,6 +120,7 @@ struct cfq_data {
>  	unsigned int cfq_slice[2];
>  	unsigned int cfq_slice_async_rq;
>  	unsigned int cfq_slice_idle;
> +	unsigned int cfq_slice_async_penalty;
>  
>  	struct list_head cic_list;
>  };
> @@ -288,6 +295,25 @@ static inline int cfq_slice_used(struct cfq_queue
> *cfqq)
>  }
>  
>  /*
> + * Get the slice time to be decreased
> + * when async queue is dispatched and
> + * there are unfinished requests in sync queue
> + */
> +static inline int
> +cfq_async_penalty(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> +{
> +	int penalty;
> +
> +	if (cfq_cfqq_sync(cfqq))
> +		return 0;
> +
> +	penalty = (cfqq->slice_end - jiffies) *
> +			cfqd->cfq_slice_async_penalty / 100;
> +
> +	return penalty;
> +}
> +
> +/*
>   * Lifted from AS - choose which of rq1 and rq2 that is best served
> now.
>   * We choose the request that is closest to the head right now.
> Distance
>   * behind the head is penalized and only allowed to a certain extent.
> @@ -1103,7 +1129,7 @@ static int cfq_dispatch_requests(struct
> request_queue *q, int force)
>  		}
>  
>  		if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> -			break;
> +			cfqq->slice_end -= cfq_async_penalty(cfqd, cfqq);
>  
>  		cfq_clear_cfqq_must_dispatch(cfqq);
>  		cfq_clear_cfqq_wait_request(cfqq);
> @@ -2143,6 +2169,7 @@ static void *cfq_init_queue(struct request_queue
> *q)
>  	cfqd->cfq_slice[1] = cfq_slice_sync;
>  	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
>  	cfqd->cfq_slice_idle = cfq_slice_idle;
> +	cfqd->cfq_slice_async_penalty = cfq_slice_async_penalty;
>  
>  	return cfqd;
>  }
> @@ -2211,6 +2238,7 @@ SHOW_FUNCTION(cfq_slice_idle_show,
> cfqd->cfq_slice_idle, 1);
>  SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
>  SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
>  SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
> +SHOW_FUNCTION(cfq_slice_async_penalty_show,
> cfqd->cfq_slice_async_penalty, 0);
>  #undef SHOW_FUNCTION
>  
>  #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> @@ -2242,6 +2270,8 @@ STORE_FUNCTION(cfq_slice_sync_store,
> &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
>  STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX,
> 1);
>  STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
>  		UINT_MAX, 0);
> +STORE_FUNCTION(cfq_slice_async_penalty_store,
> &cfqd->cfq_slice_async_penalty, 0,
> +		100, 0);
>  #undef STORE_FUNCTION
>  
>  #define CFQ_ATTR(name) \
> @@ -2257,6 +2287,7 @@ static struct elv_fs_entry cfq_attrs[] = {
>  	CFQ_ATTR(slice_async),
>  	CFQ_ATTR(slice_async_rq),
>  	CFQ_ATTR(slice_idle),
> +	CFQ_ATTR(slice_async_penalty),
>  	__ATTR_NULL
>  };
>  
> Lin Ming
> 
> On Tue, 2008-06-17 at 11:31 +0200, Jens Axboe wrote:
> > On Tue, Jun 17 2008, Lin Ming wrote:
> > > 
> > > On Tue, 2008-06-17 at 10:36 +0200, Jens Axboe wrote:
> > > > On Tue, Jun 17 2008, Zhang, Yanmin wrote:
> > > > > >>-----Original Message-----
> > > > > >>From: Jens Axboe [mailto:jens.axboe@oracle.com]
> > > > > >>Sent: Tuesday, June 17, 2008 3:30 AM
> > > > > >>To: Lin, Ming M
> > > > > >>Cc: Zhang, Yanmin; Linux Kernel Mailing List
> > > > > >>Subject: Re: FIO: kjournald blocked for more than 120 seconds
> > > > > >>
> > > > > >>On Mon, Jun 16 2008, Lin Ming wrote:
> > > > > >>> Hi, Jens
> > > > > >>>
> > > > > >>> When runnig FIO benchmark, kjournald blocked for more than 120
> > > > > seconds.
> > > > > >>> Detailed root cause analysis and proposed solutions as below.
> > > > > >>>
> > > > > >>> Any comment is appreciated.
> > > > > >>>
> > > > > >>> Hardware Environment
> > > > > >>> ---------------------
> > > > > >>> 13 SEAGATE ST373307FC disks in a JBOD, connected by a Qlogic ISP2312
> > > > > >>> Fibe Channel HBA.
> > > > > >>>
> > > > > >>> Bug description
> > > > > >>> ----------------
> > > > > >>> fio vsync random read 4K in 13 disks, 4 processes per disk, fio
> > > > > global
> > > > > >>> paramter as below,
> > > > > >>> Tested 4 IO schedulers, issue is only seen in CFQ.
> > > > > >>>
> > > > > >>> INFO: task kjournald:20558 blocked for more than 120 seconds.
> > > > > >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > > > >>> message.
> > > > > >>> kjournald     D ffff810010820978  6712 20558      2
> > > > > >>> ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
> > > > > >>> ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
> > > > > >>> 0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
> > > > > >>> Call Trace:
> > > > > >>> [<ffffffff803ba6f2>] kobject_get+0x12/0x17
> > > > > >>> The disks of my testing machine are tagged devices, so the CFQ idle
> > > > > >>> window is disabled. In other words, the active queue of tagged
> > > > > >>> devices(cfqd->hw_tag=1) never idle for a new request.
> > > > > >>>
> > > > > >>> This causes active queue be expired immediately if it's empty,
> > > > > although
> > > > > >>> it has not run out of time. CFQ will select next queue as active
> > > > > queue.
> > > > > >>> In this testcase, there are thousands of FIO read requests in sync
> > > > > >>> queues, only a few write requests by journal_write_commit_record in
> > > > > >>> async queues.
> > > > > >>>
> > > > > >>> In the other hand, all processes use the default io class and
> > > > > priority.
> > > > > >>> They share the async queue for the same device, but have their own
> > > > > sync
> > > > > >>> queue, so the sync queue number is 4 while asyn queue number is just
> > > > > 1
> > > > > >>> for the same device.
> > > > > >>>
> > > > > >>> So sync queue has much more chances be selected as new active queue
> > > > > than
> > > > > >>> async queue.
> > > > > >>>
> > > > > >>> Sync queues do not idle and they are dispatched all the time. This
> > > > > leads
> > > > > >>> to many unfinished requests in external queue,
> > > > > >>> namely, cfqd->sync_flight > 0.
> > > > > >>>
> > > > > >>> static int cfq_dispatch_requests (...) {
> > > > > >>> 	....
> > > > > >>> 	while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
> > > > > >>> 	....
> > > > > >>> 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > > > >>> 		break;
> > > > > >>> 		....
> > > > > >>> 		__cfq_dispatch_requests(cfqq)
> > > > > >>> 	}
> > > > > >>> 	....
> > > > > >>> }
> > > > > >>>
> > > > > >>> When cfq_select_queue selects the async queue which includes
> > > > > kjournald's
> > > > > >>> write request, this selected async queue will never be dispatched
> > > > > since
> > > > > >>> cfqd->sync_flight > 0, so kjournald is blocked.
> > > > > >>>
> > > > > >>> Proposed 3 solutions
> > > > > >>> ------------------
> > > > > >>> 1. Do not check cfqd->sync_flight
> > > > > >>>
> > > > > >>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > > > >>> -                       break;
> > > > > >>>
> > > > > >>> 2. If we do need to check cfqd->sync_flight, then for tagged
> > > > > devices, we
> > > > > >>> should give a little more chances to async queue to be dispatched.
> > > > > >>>
> > > > > >>> @@ -1102,7 +1102,7 @@ static int cfq_dispatch_requests(struct
> > > > > >>> request_queue *q, int force)
> > > > > >>>                                 break;
> > > > > >>>                 }
> > > > > >>>
> > > > > >>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > > > >>> +               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq) && !
> > > > > >>> cfqd->hw_tag)
> > > > > >>>                         break;
> > > > > >>>
> > > > > >>> 3. Force write request issued by journal_write_commit_record as sync
> > > > > >>> request. As a matter of fact, it looks like most write requests
> > > > > >>> submitted by kjournald is async request. We need convert them to
> > > > > sync
> > > > > >>> requests.
> > > > > >>
> > > > > >>Thanks for the very detailed analysis of the problem, complete with
> > > > > >>suggestions. While I think that any code that does:
> > > > > >>
> > > > > >>        submit async io
> > > > > >>        wait for it
> > > > > >>
> > > > > >>should be issuing sync IO (or, better, automatically upgrade the
> > > > > request
> > > > > >>from async -> sync), we cannot rely on that.
> > > > > [YM] We can talk case by case. We could convert some important async io
> > > > > codes
> > > > >  to sync io codes at least. For example, kjournald calls
> > > > > sync_dirty_buffer what 
> > > > > we captured in this case.
> > > > 
> > > > I agree, we should fix the obvious cases. My point was merely that there
> > > > will probably always be missed cases, so we should attempt to handle it
> > > > in the scheduler as well. Does the below buffer patch make it any
> > > > better?
> > > 
> > > Yes, kjournald blocked issue is gone with below patch applied.
> > 
> > I think it's obviously the right thing to do, but I'm also a bit worried
> > about applying it so close to 2.6.26 release. OTOH, we need to do
> > SOMETHING for 2.6.26 release, so...
> > 
> > > 
> > > Lin Ming
> > > 
> > > > 
> > > > > Another case is writeback. If processes do mmapped I/O and they might
> > > > > stop in 
> > > > > page fault to wait writeback finishing. Or a buffer write might trigger
> > > > > a dirty 
> > > > > page balance. As the latest kernel is more aggressive to start
> > > > > writeback, it might 
> > > > > be an issue now.
> > > > 
> > > > Sync process getting stuck in async writeout is another problem of the
> > > > same variety.
> > > > 
> > > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > > index a073f3f..1957a8f 100644
> > > > --- a/fs/buffer.c
> > > > +++ b/fs/buffer.c
> > > > @@ -2978,7 +2978,7 @@ int sync_dirty_buffer(struct buffer_head *bh)
> > > >  	if (test_clear_buffer_dirty(bh)) {
> > > >  		get_bh(bh);
> > > >  		bh->b_end_io = end_buffer_write_sync;
> > > > -		ret = submit_bh(WRITE, bh);
> > > > +		ret = submit_bh(WRITE_SYNC, bh);
> > > >  		wait_on_buffer(bh);
> > > >  		if (buffer_eopnotsupp(bh)) {
> > > >  			clear_buffer_eopnotsupp(bh);
> > > > 
> > > 
> > 
> 

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: FIO: kjournald blocked for more than 120 seconds
  2008-06-26  9:49             ` Jens Axboe
@ 2008-06-27  8:09               ` Lin Ming
  0 siblings, 0 replies; 9+ messages in thread
From: Lin Ming @ 2008-06-27  8:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Zhang, Yanmin, Linux Kernel Mailing List


On Thu, 2008-06-26 at 11:49 +0200, Jens Axboe wrote:
> On Thu, Jun 26 2008, Lin Ming wrote:
> > Jens,
> > 
> > How about below patch?
> > 
> > If cfqd->sync_flight > 0, we still allow async queue to be dispached,
> > and its slice time would be decreased. So sync queue never starve async
> > queue.
> > 
> > Introduce a sysfs interface cfq_slice_async_penalty to adjust the
> > percentages of slice that will be decreased.
> 
> I don't think it's a good change, for a variety of reasons:
> 
> - The reason we don't dispatch async with sync in flight, is to avoid
>   interference of the async IO with the sync IO. Perhaps that is OK in
>   this case, since elv_dispatch_sort() by default doesn't allow crossing
>   read vs write boundaries.
> 
> - async queue is typically expired immediately, so the slice length
>   doesn't factor into it so much.
> 
> - I don't think more tunables are going to help solve anything.
> 
> With the current check, we risk starving async IO in presence of lots of
> sync IO. Without the check, we risk starving sync process in presence of
> lots of async IO. So we need some sort of middle ground. Async IO is
> penalized in the presence of lots of sync IO already. As the number of
> sync queues grow, async IO gets a continually smaller part of the total
> disk time. That isn't very fair to begin with.
> 
> For 2.6.26, how about adding the WRITE_SYNC change to buffer.c that we
> discussed earlier? I've been using it here since and it's a less risky
> change. Then for 2.6.27 we can look into tweaking CFQ.

Sounds good. 
OK.

Lin Ming

> 
> 
> 
> > Signed-off-by: Lin Ming <ming.m.lin@intel.com>
> > ---
> >  block/cfq-iosched.c |   33 ++++++++++++++++++++++++++++++++-
> >  1 files changed, 32 insertions(+), 1 deletions(-)
> > 
> > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> > index d01b411..8bdc931 100644
> > --- a/block/cfq-iosched.c
> > +++ b/block/cfq-iosched.c
> > @@ -26,6 +26,12 @@ static const int cfq_slice_sync = HZ / 10;
> >  static int cfq_slice_async = HZ / 25;
> >  static const int cfq_slice_async_rq = 2;
> >  static int cfq_slice_idle = HZ / 125;
> > +/*
> > + * penalty of dispatching async queue
> > + * if there are unfinished requests in sync queue
> > + * by default, 50% of time slice decreased
> > + */
> > +static int cfq_slice_async_penalty = 50;
> >  
> >  /*
> >   * offset from end of service tree
> > @@ -114,6 +120,7 @@ struct cfq_data {
> >  	unsigned int cfq_slice[2];
> >  	unsigned int cfq_slice_async_rq;
> >  	unsigned int cfq_slice_idle;
> > +	unsigned int cfq_slice_async_penalty;
> >  
> >  	struct list_head cic_list;
> >  };
> > @@ -288,6 +295,25 @@ static inline int cfq_slice_used(struct cfq_queue
> > *cfqq)
> >  }
> >  
> >  /*
> > + * Get the slice time to be decreased
> > + * when async queue is dispatched and
> > + * there are unfinished requests in sync queue
> > + */
> > +static inline int
> > +cfq_async_penalty(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> > +{
> > +	int penalty;
> > +
> > +	if (cfq_cfqq_sync(cfqq))
> > +		return 0;
> > +
> > +	penalty = (cfqq->slice_end - jiffies) *
> > +			cfqd->cfq_slice_async_penalty / 100;
> > +
> > +	return penalty;
> > +}
> > +
> > +/*
> >   * Lifted from AS - choose which of rq1 and rq2 that is best served
> > now.
> >   * We choose the request that is closest to the head right now.
> > Distance
> >   * behind the head is penalized and only allowed to a certain extent.
> > @@ -1103,7 +1129,7 @@ static int cfq_dispatch_requests(struct
> > request_queue *q, int force)
> >  		}
> >  
> >  		if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > -			break;
> > +			cfqq->slice_end -= cfq_async_penalty(cfqd, cfqq);
> >  
> >  		cfq_clear_cfqq_must_dispatch(cfqq);
> >  		cfq_clear_cfqq_wait_request(cfqq);
> > @@ -2143,6 +2169,7 @@ static void *cfq_init_queue(struct request_queue
> > *q)
> >  	cfqd->cfq_slice[1] = cfq_slice_sync;
> >  	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
> >  	cfqd->cfq_slice_idle = cfq_slice_idle;
> > +	cfqd->cfq_slice_async_penalty = cfq_slice_async_penalty;
> >  
> >  	return cfqd;
> >  }
> > @@ -2211,6 +2238,7 @@ SHOW_FUNCTION(cfq_slice_idle_show,
> > cfqd->cfq_slice_idle, 1);
> >  SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
> >  SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
> >  SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
> > +SHOW_FUNCTION(cfq_slice_async_penalty_show,
> > cfqd->cfq_slice_async_penalty, 0);
> >  #undef SHOW_FUNCTION
> >  
> >  #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> > @@ -2242,6 +2270,8 @@ STORE_FUNCTION(cfq_slice_sync_store,
> > &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
> >  STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX,
> > 1);
> >  STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
> >  		UINT_MAX, 0);
> > +STORE_FUNCTION(cfq_slice_async_penalty_store,
> > &cfqd->cfq_slice_async_penalty, 0,
> > +		100, 0);
> >  #undef STORE_FUNCTION
> >  
> >  #define CFQ_ATTR(name) \
> > @@ -2257,6 +2287,7 @@ static struct elv_fs_entry cfq_attrs[] = {
> >  	CFQ_ATTR(slice_async),
> >  	CFQ_ATTR(slice_async_rq),
> >  	CFQ_ATTR(slice_idle),
> > +	CFQ_ATTR(slice_async_penalty),
> >  	__ATTR_NULL
> >  };
> >  
> > Lin Ming
> > 
> > On Tue, 2008-06-17 at 11:31 +0200, Jens Axboe wrote:
> > > On Tue, Jun 17 2008, Lin Ming wrote:
> > > > 
> > > > On Tue, 2008-06-17 at 10:36 +0200, Jens Axboe wrote:
> > > > > On Tue, Jun 17 2008, Zhang, Yanmin wrote:
> > > > > > >>-----Original Message-----
> > > > > > >>From: Jens Axboe [mailto:jens.axboe@oracle.com]
> > > > > > >>Sent: Tuesday, June 17, 2008 3:30 AM
> > > > > > >>To: Lin, Ming M
> > > > > > >>Cc: Zhang, Yanmin; Linux Kernel Mailing List
> > > > > > >>Subject: Re: FIO: kjournald blocked for more than 120 seconds
> > > > > > >>
> > > > > > >>On Mon, Jun 16 2008, Lin Ming wrote:
> > > > > > >>> Hi, Jens
> > > > > > >>>
> > > > > > >>> When runnig FIO benchmark, kjournald blocked for more than 120
> > > > > > seconds.
> > > > > > >>> Detailed root cause analysis and proposed solutions as below.
> > > > > > >>>
> > > > > > >>> Any comment is appreciated.
> > > > > > >>>
> > > > > > >>> Hardware Environment
> > > > > > >>> ---------------------
> > > > > > >>> 13 SEAGATE ST373307FC disks in a JBOD, connected by a Qlogic ISP2312
> > > > > > >>> Fibe Channel HBA.
> > > > > > >>>
> > > > > > >>> Bug description
> > > > > > >>> ----------------
> > > > > > >>> fio vsync random read 4K in 13 disks, 4 processes per disk, fio
> > > > > > global
> > > > > > >>> paramter as below,
> > > > > > >>> Tested 4 IO schedulers, issue is only seen in CFQ.
> > > > > > >>>
> > > > > > >>> INFO: task kjournald:20558 blocked for more than 120 seconds.
> > > > > > >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > > > > >>> message.
> > > > > > >>> kjournald     D ffff810010820978  6712 20558      2
> > > > > > >>> ffff81022ddb1d10 0000000000000046 ffff81022e7baa10 ffffffff803ba6f2
> > > > > > >>> ffff81022ecd0000 ffff8101e6dc9160 ffff81022ecd0348 000000008048b6cb
> > > > > > >>> 0000000000000086 ffff81022c4e8d30 0000000000000000 ffffffff80247537
> > > > > > >>> Call Trace:
> > > > > > >>> [<ffffffff803ba6f2>] kobject_get+0x12/0x17
> > > > > > >>> The disks of my testing machine are tagged devices, so the CFQ idle
> > > > > > >>> window is disabled. In other words, the active queue of tagged
> > > > > > >>> devices(cfqd->hw_tag=1) never idle for a new request.
> > > > > > >>>
> > > > > > >>> This causes active queue be expired immediately if it's empty,
> > > > > > although
> > > > > > >>> it has not run out of time. CFQ will select next queue as active
> > > > > > queue.
> > > > > > >>> In this testcase, there are thousands of FIO read requests in sync
> > > > > > >>> queues, only a few write requests by journal_write_commit_record in
> > > > > > >>> async queues.
> > > > > > >>>
> > > > > > >>> In the other hand, all processes use the default io class and
> > > > > > priority.
> > > > > > >>> They share the async queue for the same device, but have their own
> > > > > > sync
> > > > > > >>> queue, so the sync queue number is 4 while asyn queue number is just
> > > > > > 1
> > > > > > >>> for the same device.
> > > > > > >>>
> > > > > > >>> So sync queue has much more chances be selected as new active queue
> > > > > > than
> > > > > > >>> async queue.
> > > > > > >>>
> > > > > > >>> Sync queues do not idle and they are dispatched all the time. This
> > > > > > leads
> > > > > > >>> to many unfinished requests in external queue,
> > > > > > >>> namely, cfqd->sync_flight > 0.
> > > > > > >>>
> > > > > > >>> static int cfq_dispatch_requests (...) {
> > > > > > >>> 	....
> > > > > > >>> 	while ((cfqq = cfq_select_queue(cfqd)) != NULL) {
> > > > > > >>> 	....
> > > > > > >>> 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > > > > >>> 		break;
> > > > > > >>> 		....
> > > > > > >>> 		__cfq_dispatch_requests(cfqq)
> > > > > > >>> 	}
> > > > > > >>> 	....
> > > > > > >>> }
> > > > > > >>>
> > > > > > >>> When cfq_select_queue selects the async queue which includes
> > > > > > kjournald's
> > > > > > >>> write request, this selected async queue will never be dispatched
> > > > > > since
> > > > > > >>> cfqd->sync_flight > 0, so kjournald is blocked.
> > > > > > >>>
> > > > > > >>> Proposed 3 solutions
> > > > > > >>> ------------------
> > > > > > >>> 1. Do not check cfqd->sync_flight
> > > > > > >>>
> > > > > > >>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > > > > >>> -                       break;
> > > > > > >>>
> > > > > > >>> 2. If we do need to check cfqd->sync_flight, then for tagged
> > > > > > devices, we
> > > > > > >>> should give a little more chances to async queue to be dispatched.
> > > > > > >>>
> > > > > > >>> @@ -1102,7 +1102,7 @@ static int cfq_dispatch_requests(struct
> > > > > > >>> request_queue *q, int force)
> > > > > > >>>                                 break;
> > > > > > >>>                 }
> > > > > > >>>
> > > > > > >>> -               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > > > > >>> +               if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq) && !
> > > > > > >>> cfqd->hw_tag)
> > > > > > >>>                         break;
> > > > > > >>>
> > > > > > >>> 3. Force write request issued by journal_write_commit_record as sync
> > > > > > >>> request. As a matter of fact, it looks like most write requests
> > > > > > >>> submitted by kjournald is async request. We need convert them to
> > > > > > sync
> > > > > > >>> requests.
> > > > > > >>
> > > > > > >>Thanks for the very detailed analysis of the problem, complete with
> > > > > > >>suggestions. While I think that any code that does:
> > > > > > >>
> > > > > > >>        submit async io
> > > > > > >>        wait for it
> > > > > > >>
> > > > > > >>should be issuing sync IO (or, better, automatically upgrade the
> > > > > > request
> > > > > > >>from async -> sync), we cannot rely on that.
> > > > > > [YM] We can talk case by case. We could convert some important async io
> > > > > > codes
> > > > > >  to sync io codes at least. For example, kjournald calls
> > > > > > sync_dirty_buffer what 
> > > > > > we captured in this case.
> > > > > 
> > > > > I agree, we should fix the obvious cases. My point was merely that there
> > > > > will probably always be missed cases, so we should attempt to handle it
> > > > > in the scheduler as well. Does the below buffer patch make it any
> > > > > better?
> > > > 
> > > > Yes, kjournald blocked issue is gone with below patch applied.
> > > 
> > > I think it's obviously the right thing to do, but I'm also a bit worried
> > > about applying it so close to 2.6.26 release. OTOH, we need to do
> > > SOMETHING for 2.6.26 release, so...
> > > 
> > > > 
> > > > Lin Ming
> > > > 
> > > > > 
> > > > > > Another case is writeback. If processes do mmapped I/O and they might
> > > > > > stop in 
> > > > > > page fault to wait writeback finishing. Or a buffer write might trigger
> > > > > > a dirty 
> > > > > > page balance. As the latest kernel is more aggressive to start
> > > > > > writeback, it might 
> > > > > > be an issue now.
> > > > > 
> > > > > Sync process getting stuck in async writeout is another problem of the
> > > > > same variety.
> > > > > 
> > > > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > > > index a073f3f..1957a8f 100644
> > > > > --- a/fs/buffer.c
> > > > > +++ b/fs/buffer.c
> > > > > @@ -2978,7 +2978,7 @@ int sync_dirty_buffer(struct buffer_head *bh)
> > > > >  	if (test_clear_buffer_dirty(bh)) {
> > > > >  		get_bh(bh);
> > > > >  		bh->b_end_io = end_buffer_write_sync;
> > > > > -		ret = submit_bh(WRITE, bh);
> > > > > +		ret = submit_bh(WRITE_SYNC, bh);
> > > > >  		wait_on_buffer(bh);
> > > > >  		if (buffer_eopnotsupp(bh)) {
> > > > >  			clear_buffer_eopnotsupp(bh);
> > > > > 
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-06-27  8:12 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-16  2:04 FIO: kjournald blocked for more than 120 seconds Lin Ming
2008-06-16 19:29 ` Jens Axboe
2008-06-17  1:40   ` Zhang, Yanmin
2008-06-17  8:36     ` Jens Axboe
2008-06-17  9:02       ` Lin Ming
2008-06-17  9:31         ` Jens Axboe
2008-06-26  9:33           ` Lin Ming
2008-06-26  9:49             ` Jens Axboe
2008-06-27  8:09               ` Lin Ming

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox