* [PATCH] core: Actually EIO is a fatal error
@ 2012-09-21 11:04 Dmitry Monakhov
2012-09-21 11:25 ` Jens Axboe
0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Monakhov @ 2012-09-21 11:04 UTC (permalink / raw)
To: fio; +Cc: axboe, Dmitry Monakhov
As soon as i understand this is just a mistype.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
fio.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/fio.h b/fio.h
index b2bbe93..f6f9792 100644
--- a/fio.h
+++ b/fio.h
@@ -559,7 +559,7 @@ static inline void fio_ro_check(struct thread_data *td, struct io_u *io_u)
#define REAL_MAX_JOBS 2048
-#define td_non_fatal_error(e) ((e) == EIO || (e) == EILSEQ)
+#define td_non_fatal_error(e) (!((e) == EIO || (e) == EILSEQ))
static inline enum error_type td_error_type(enum fio_ddir ddir, int err)
{
--
1.7.7.6
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] core: Actually EIO is a fatal error
2012-09-21 11:04 [PATCH] core: Actually EIO is a fatal error Dmitry Monakhov
@ 2012-09-21 11:25 ` Jens Axboe
2012-09-21 11:42 ` Dmitry Monakhov
0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2012-09-21 11:25 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: fio
On 09/21/2012 01:04 PM, Dmitry Monakhov wrote:
> As soon as i understand this is just a mistype.
It's not a typo. By that logic, EILSEQ is fatal too, since it is a
verification failure of read data (so might as well have been an EIO).
Fatal, in this context, means errors that fio can recover from and
continue doing work.
--
Jens Axboe
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] core: Actually EIO is a fatal error
2012-09-21 11:25 ` Jens Axboe
@ 2012-09-21 11:42 ` Dmitry Monakhov
2012-09-21 12:00 ` Jens Axboe
0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Monakhov @ 2012-09-21 11:42 UTC (permalink / raw)
To: Jens Axboe; +Cc: fio
On Fri, 21 Sep 2012 13:25:37 +0200, Jens Axboe <axboe@kernel.dk> wrote:
> On 09/21/2012 01:04 PM, Dmitry Monakhov wrote:
> > As soon as i understand this is just a mistype.
>
> It's not a typo. By that logic, EILSEQ is fatal too, since it is a
> verification failure of read data (so might as well have been an EIO).
> Fatal, in this context, means errors that fio can recover from and
> continue doing work.
Ohh i ment to say that both errors are fatal, but function called
td_NON_fatal_error, and it result true in case of EIO or EILSEQ
this result continue_on_error logic broken because
io_u.c 1440:
if (icd->error && td_non_fatal_error(icd->error) &&
(td->o.continue_on_error & td_error_type(io_u->ddir,
icd->error))) {
/*
* If there is a non_fatal error, then add to the error
count
* and clear all the errors.
*/
update_error_count(td, icd->error);
td_clear_error(td);
icd->error = 0;
io_u->error = 0;
}
that's why i've inverted result.
FYI right after i've changed this my test which continuously hit ENOSPC
goes forward and provoke panic :)
WARNING: at lib/list_debug.c:62 __list_del_entry+0x1ee/0x250()
Hardware name:
list_del corruption. next->prev should be ffff88022d5c1a30, but was
ffff880231f3e558
Modules linked in: ext4 jbd2 cpufreq_ondemand acpi_cpufreq freq_table
mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode
sg xhci_hcd ext3 jbd mbcache sd_mod crc_t10dif aesni_intel ablk_helper
cryptd aes_x86_64 aes_generic ahci libahci pata_acpi ata_generic
dm_mirror dm_region_hash dm_log dm_mod
Pid: 241, comm: kworker/u:3 Not tainted 3.6.0-rc1+ #62
Call Trace:
[<ffffffff81074523>] warn_slowpath_common+0xc3/0xf0
[<ffffffff81074606>] warn_slowpath_fmt+0x46/0x50
[<ffffffff8135eace>] __list_del_entry+0x1ee/0x250
[<ffffffff8109d4de>] move_linked_works+0x4e/0xd0
[<ffffffff810a0070>] cwq_activate_first_delayed+0xf0/0x120
[<ffffffff810a0819>] ? process_one_work+0x619/0x770
[<ffffffff810a0147>] cwq_dec_nr_in_flight+0xa7/0x160
[<ffffffff810a0819>] ? process_one_work+0x619/0x770
[<ffffffff810a08c9>] process_one_work+0x6c9/0x770
[<ffffffff810a0541>] ? process_one_work+0x341/0x770
[<ffffffffa03d0850>] ? put_io_page+0x60/0x60 [ext4]
[<ffffffff810a171c>] worker_thread+0x1cc/0x330
[<ffffffff810a1550>] ? manage_workers+0x140/0x140
[<ffffffff810a9d39>] kthread+0xc9/0xe0
[<ffffffff8175f6c4>] kernel_thread_helper+0x4/0x10
[<ffffffff81752f70>] ? retint_restore_args+0x13/0x13
[<ffffffff810a9c70>] ? __init_kthread_worker+0x70/0x70
[<ffffffff8175f6c0>] ? gs_change+0x13/0x13
---[ end trace abc6d2e3c8581c4a ]---
------------[ cut here ]------------
WARNING: at lib/list_debug.c:33 __list_add+0xdc/0x180()
Hardware name:
list_add corruption. prev->next should be next (ffff880229a1e260), but
was ffff880231f3e558. (prev=ffff880231f3e558).
Modules linked in: ext4 jbd2 cpufreq_ondemand acpi_cpufreq freq_table
mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode
sg xhci_hcd ext3 jbd mbcache sd_mod crc_t10dif aesni_intel ablk_helper
cryptd aes_x86_64 aes_generic ahci libahci pata_acpi ata_generic
dm_mirror dm_region_hash dm_log dm_mod
Pid: 0, comm: swapper/3 Tainted: G W 3.6.0-rc1+ #62
Call Trace:
<IRQ> [<ffffffff81074523>] warn_slowpath_common+0xc3/0xf0
[<ffffffff81074606>] warn_slowpath_fmt+0x46/0x50
[<ffffffff8135de3e>] ? __spin_lock_debug+0xae/0x110
[<ffffffff8135ec4c>] __list_add+0xdc/0x180
[<ffffffff8109fa10>] insert_work+0x80/0xd0
[<ffffffff810a2536>] __queue_work+0x4d6/0x5a0
[<ffffffffa03d0a04>] ? ext4_add_complete_io+0x54/0xc0 [ext4]
[<ffffffff810a2752>] queue_work_on+0x32/0x40
[<ffffffff810a27b8>] queue_work+0x38/0x50
[<ffffffffa03d0a34>] ext4_add_complete_io+0x84/0xc0 [ext4]
[<ffffffff817527e5>] ? _raw_spin_unlock_irqrestore+0x65/0x90
[<ffffffffa03c6c1d>] ext4_end_io_dio+0xdd/0xf0 [ext4]
[<ffffffff81261e95>] dio_complete+0x125/0x1a0
[<ffffffff81261fba>] dio_bio_end_aio+0xaa/0x100
[<ffffffff81185da7>] ? mempool_free_slab+0x17/0x20
[<ffffffff8125aba6>] bio_endio+0x76/0x80
[<ffffffffa0002bd9>] dec_pending+0x279/0x340 [dm_mod]
[<ffffffffa000360f>] clone_endio+0x12f/0x150 [dm_mod]
[<ffffffff8125aba6>] bio_endio+0x76/0x80
[<ffffffff812fe0cc>] req_bio_endio+0x15c/0x180
[<ffffffff81301fa6>] blk_update_request+0x216/0x630
[<ffffffff813023f5>] blk_update_bidi_request+0x35/0xf0
[<ffffffff813024dc>] blk_end_bidi_request+0x2c/0x90
[<ffffffff81302610>] blk_end_request+0x10/0x20
[<ffffffff8148cc80>] scsi_end_request+0x40/0xf0
[<ffffffff8148d0cc>] scsi_io_completion+0x32c/0x850
[<ffffffff8147f32b>] scsi_finish_command+0x1bb/0x1e0
[<ffffffff8148cb48>] scsi_softirq_done+0x158/0x1d0
[<ffffffff8130d5ac>] blk_done_softirq+0x8c/0xa0
[<ffffffff81080dfa>] __do_softirq+0x1ba/0x3e0
[<ffffffff8175283b>] ? _raw_spin_unlock+0x2b/0x50
[<ffffffff8175f7bc>] call_softirq+0x1c/0x30
[<ffffffff810206c4>] do_softirq+0x94/0x1d0
[<ffffffff8108136a>] irq_exit+0x7a/0x140
[<ffffffff817600c5>] do_IRQ+0xd5/0x100
[<ffffffff81752eaf>] common_interrupt+0x6f/0x6f
<EOI> [<ffffffff813a3bfc>] ? intel_idle+0x19c/0x1f0
[<ffffffff813a3bf8>] ? intel_idle+0x198/0x1f0
[<ffffffff815c75a9>] cpuidle_enter+0x19/0x20
[<ffffffff815c7c47>] cpuidle_enter_state+0x17/0x60
[<ffffffff815c7f3f>] cpuidle_idle_call+0x2af/0x4e0
[<ffffffff8113f97a>] ? rcu_idle_enter+0x19a/0x1d0
[<ffffffff8102b0ef>] cpu_idle+0xff/0x190
[<ffffffff8102affd>] ? cpu_idle+0xd/0x190
[<ffffffff81724beb>] start_secondary+0xcd/0xcf
---[ end trace abc6d2e3c8581c4b ]---
>
>
> --
> Jens Axboe
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] core: Actually EIO is a fatal error
2012-09-21 11:42 ` Dmitry Monakhov
@ 2012-09-21 12:00 ` Jens Axboe
2012-09-21 12:13 ` Dmitry Monakhov
0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2012-09-21 12:00 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: fio
On 09/21/2012 01:42 PM, Dmitry Monakhov wrote:
> On Fri, 21 Sep 2012 13:25:37 +0200, Jens Axboe <axboe@kernel.dk> wrote:
>> On 09/21/2012 01:04 PM, Dmitry Monakhov wrote:
>>> As soon as i understand this is just a mistype.
>>
>> It's not a typo. By that logic, EILSEQ is fatal too, since it is a
>> verification failure of read data (so might as well have been an EIO).
>> Fatal, in this context, means errors that fio can recover from and
>> continue doing work.
> Ohh i ment to say that both errors are fatal, but function called
And I'm saying that NEITHER of them are fatal.
> td_NON_fatal_error, and it result true in case of EIO or EILSEQ
> this result continue_on_error logic broken because
> io_u.c 1440:
> if (icd->error && td_non_fatal_error(icd->error) &&
> (td->o.continue_on_error & td_error_type(io_u->ddir,
> icd->error))) {
Right, so if error and error is non-fatal, we continue on that error
unless told otherwise. It is logged and we continue on our business.
So I'm a little confused as to why you think the test is reverted...
> FYI right after i've changed this my test which continuously hit ENOSPC
> goes forward and provoke panic :)
> WARNING: at lib/list_debug.c:62 __list_del_entry+0x1ee/0x250()
Heh, always great to trigger kernel bugs with fio :-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] core: Actually EIO is a fatal error
2012-09-21 12:00 ` Jens Axboe
@ 2012-09-21 12:13 ` Dmitry Monakhov
2012-09-21 12:20 ` Jens Axboe
0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Monakhov @ 2012-09-21 12:13 UTC (permalink / raw)
To: Jens Axboe; +Cc: fio
On Fri, 21 Sep 2012 14:00:18 +0200, Jens Axboe <axboe@kernel.dk> wrote:
> On 09/21/2012 01:42 PM, Dmitry Monakhov wrote:
> > On Fri, 21 Sep 2012 13:25:37 +0200, Jens Axboe <axboe@kernel.dk> wrote:
> >> On 09/21/2012 01:04 PM, Dmitry Monakhov wrote:
> >>> As soon as i understand this is just a mistype.
> >>
> >> It's not a typo. By that logic, EILSEQ is fatal too, since it is a
> >> verification failure of read data (so might as well have been an EIO).
> >> Fatal, in this context, means errors that fio can recover from and
> >> continue doing work.
> > Ohh i ment to say that both errors are fatal, but function called
>
> And I'm saying that NEITHER of them are fatal.
>
> > td_NON_fatal_error, and it result true in case of EIO or EILSEQ
> > this result continue_on_error logic broken because
> > io_u.c 1440:
> > if (icd->error && td_non_fatal_error(icd->error) &&
> > (td->o.continue_on_error & td_error_type(io_u->ddir,
> > icd->error))) {
>
> Right, so if error and error is non-fatal, we continue on that error
> unless told otherwise. It is logged and we continue on our business.
Please dint get me wrong .... but please take a look more carefully
Original code: ((e) == EIO || (e) == EILSEQ)
True for fatal errors, and false for non fatal ones
But function called td_NON_fatal_error()
And it should result opposite result
so my code: (!((e) == EIO || (e) == EILSEQ)) is equivalent of
(err != EIO) && (err != EILSEQ)
>
> So I'm a little confused as to why you think the test is reverted...
>
> > FYI right after i've changed this my test which continuously hit ENOSPC
> > goes forward and provoke panic :)
> > WARNING: at lib/list_debug.c:62 __list_del_entry+0x1ee/0x250()
>
> Heh, always great to trigger kernel bugs with fio :-)
>
> --
> Jens Axboe
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] core: Actually EIO is a fatal error
2012-09-21 12:13 ` Dmitry Monakhov
@ 2012-09-21 12:20 ` Jens Axboe
2012-09-21 12:56 ` Dmitry Monakhov
0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2012-09-21 12:20 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: fio
On 09/21/2012 02:13 PM, Dmitry Monakhov wrote:
> On Fri, 21 Sep 2012 14:00:18 +0200, Jens Axboe <axboe@kernel.dk> wrote:
>> On 09/21/2012 01:42 PM, Dmitry Monakhov wrote:
>>> On Fri, 21 Sep 2012 13:25:37 +0200, Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 09/21/2012 01:04 PM, Dmitry Monakhov wrote:
>>>>> As soon as i understand this is just a mistype.
>>>>
>>>> It's not a typo. By that logic, EILSEQ is fatal too, since it is a
>>>> verification failure of read data (so might as well have been an EIO).
>>>> Fatal, in this context, means errors that fio can recover from and
>>>> continue doing work.
>>> Ohh i ment to say that both errors are fatal, but function called
>>
>> And I'm saying that NEITHER of them are fatal.
>>
>>> td_NON_fatal_error, and it result true in case of EIO or EILSEQ
>>> this result continue_on_error logic broken because
>>> io_u.c 1440:
>>> if (icd->error && td_non_fatal_error(icd->error) &&
>>> (td->o.continue_on_error & td_error_type(io_u->ddir,
>>> icd->error))) {
>>
>> Right, so if error and error is non-fatal, we continue on that error
>> unless told otherwise. It is logged and we continue on our business.
> Please dint get me wrong .... but please take a look more carefully
>
> Original code: ((e) == EIO || (e) == EILSEQ)
> True for fatal errors, and false for non fatal ones
> But function called td_NON_fatal_error()
> And it should result opposite result
>
> so my code: (!((e) == EIO || (e) == EILSEQ)) is equivalent of
> (err != EIO) && (err != EILSEQ)
You keep not reading my point. EIO and EILSEQ are are not fatal errors!!
These are "expected" in the sense that we know what conditions trigger
them.
Also see the HOWTO, continue_on_error option.
--
Jens Axboe
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] core: Actually EIO is a fatal error
2012-09-21 12:20 ` Jens Axboe
@ 2012-09-21 12:56 ` Dmitry Monakhov
2012-09-21 13:08 ` Jens Axboe
0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Monakhov @ 2012-09-21 12:56 UTC (permalink / raw)
To: Jens Axboe; +Cc: fio
On Fri, 21 Sep 2012 14:20:12 +0200, Jens Axboe <axboe@kernel.dk> wrote:
> On 09/21/2012 02:13 PM, Dmitry Monakhov wrote:
> > On Fri, 21 Sep 2012 14:00:18 +0200, Jens Axboe <axboe@kernel.dk> wrote:
> >> On 09/21/2012 01:42 PM, Dmitry Monakhov wrote:
> >>> On Fri, 21 Sep 2012 13:25:37 +0200, Jens Axboe <axboe@kernel.dk> wrote:
> >>>> On 09/21/2012 01:04 PM, Dmitry Monakhov wrote:
> >>>>> As soon as i understand this is just a mistype.
> >>>>
> >>>> It's not a typo. By that logic, EILSEQ is fatal too, since it is a
> >>>> verification failure of read data (so might as well have been an EIO).
> >>>> Fatal, in this context, means errors that fio can recover from and
> >>>> continue doing work.
> >>> Ohh i ment to say that both errors are fatal, but function called
> >>
> >> And I'm saying that NEITHER of them are fatal.
> >>
> >>> td_NON_fatal_error, and it result true in case of EIO or EILSEQ
> >>> this result continue_on_error logic broken because
> >>> io_u.c 1440:
> >>> if (icd->error && td_non_fatal_error(icd->error) &&
> >>> (td->o.continue_on_error & td_error_type(io_u->ddir,
> >>> icd->error))) {
> >>
> >> Right, so if error and error is non-fatal, we continue on that error
> >> unless told otherwise. It is logged and we continue on our business.
> > Please dint get me wrong .... but please take a look more carefully
> >
> > Original code: ((e) == EIO || (e) == EILSEQ)
> > True for fatal errors, and false for non fatal ones
> > But function called td_NON_fatal_error()
> > And it should result opposite result
> >
> > so my code: (!((e) == EIO || (e) == EILSEQ)) is equivalent of
> > (err != EIO) && (err != EILSEQ)
>
> You keep not reading my point. EIO and EILSEQ are are not fatal errors!!
> These are "expected" in the sense that we know what conditions trigger
> them.
Ok i've finally get the point. But i'm disagree with terms
beacuse most filesystems and applications interpret EIO as fatal
error. Once device return EIO to filesystem it will fall back to RO mode
or just panic. I heard about some RAID oriented HDD which tend to return
EIO ASAP so raid controller may remap bio to another drive, but this is
very special case and such devices works only with raid controller.
From my point of view non fatal error are: ENOSPC, EBUSY, EAGAIN, ENOMEM
Nor than less it would be reasonable to make fatal error list
configurable. I'll prepare a patch sortly.
>
> Also see the HOWTO, continue_on_error option.
>
> --
> Jens Axboe
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] core: Actually EIO is a fatal error
2012-09-21 12:56 ` Dmitry Monakhov
@ 2012-09-21 13:08 ` Jens Axboe
0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2012-09-21 13:08 UTC (permalink / raw)
To: Dmitry Monakhov; +Cc: fio
On 09/21/2012 02:56 PM, Dmitry Monakhov wrote:
> On Fri, 21 Sep 2012 14:20:12 +0200, Jens Axboe <axboe@kernel.dk> wrote:
>> On 09/21/2012 02:13 PM, Dmitry Monakhov wrote:
>>> On Fri, 21 Sep 2012 14:00:18 +0200, Jens Axboe <axboe@kernel.dk> wrote:
>>>> On 09/21/2012 01:42 PM, Dmitry Monakhov wrote:
>>>>> On Fri, 21 Sep 2012 13:25:37 +0200, Jens Axboe <axboe@kernel.dk> wrote:
>>>>>> On 09/21/2012 01:04 PM, Dmitry Monakhov wrote:
>>>>>>> As soon as i understand this is just a mistype.
>>>>>>
>>>>>> It's not a typo. By that logic, EILSEQ is fatal too, since it is a
>>>>>> verification failure of read data (so might as well have been an EIO).
>>>>>> Fatal, in this context, means errors that fio can recover from and
>>>>>> continue doing work.
>>>>> Ohh i ment to say that both errors are fatal, but function called
>>>>
>>>> And I'm saying that NEITHER of them are fatal.
>>>>
>>>>> td_NON_fatal_error, and it result true in case of EIO or EILSEQ
>>>>> this result continue_on_error logic broken because
>>>>> io_u.c 1440:
>>>>> if (icd->error && td_non_fatal_error(icd->error) &&
>>>>> (td->o.continue_on_error & td_error_type(io_u->ddir,
>>>>> icd->error))) {
>>>>
>>>> Right, so if error and error is non-fatal, we continue on that error
>>>> unless told otherwise. It is logged and we continue on our business.
>>> Please dint get me wrong .... but please take a look more carefully
>>>
>>> Original code: ((e) == EIO || (e) == EILSEQ)
>>> True for fatal errors, and false for non fatal ones
>>> But function called td_NON_fatal_error()
>>> And it should result opposite result
>>>
>>> so my code: (!((e) == EIO || (e) == EILSEQ)) is equivalent of
>>> (err != EIO) && (err != EILSEQ)
>>
>> You keep not reading my point. EIO and EILSEQ are are not fatal errors!!
>> These are "expected" in the sense that we know what conditions trigger
>> them.
> Ok i've finally get the point. But i'm disagree with terms
> beacuse most filesystems and applications interpret EIO as fatal
> error. Once device return EIO to filesystem it will fall back to RO mode
> or just panic. I heard about some RAID oriented HDD which tend to return
> EIO ASAP so raid controller may remap bio to another drive, but this is
> very special case and such devices works only with raid controller.
> From my point of view non fatal error are: ENOSPC, EBUSY, EAGAIN, ENOMEM
Depends on your point of view. If it's a write workload, ENOSPC probably
means "we are done, don't bother writing again". The fatal here is just
whether fio can continue safely or not. Running a job past various EIO
or verify failures is a very valid use case, instead of just terminating
on the first EIO seen.
> Nor than less it would be reasonable to make fatal error list
> configurable. I'll prepare a patch sortly.
That'd be fine indeed.
--
Jens Axboe
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-09-21 13:09 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-21 11:04 [PATCH] core: Actually EIO is a fatal error Dmitry Monakhov
2012-09-21 11:25 ` Jens Axboe
2012-09-21 11:42 ` Dmitry Monakhov
2012-09-21 12:00 ` Jens Axboe
2012-09-21 12:13 ` Dmitry Monakhov
2012-09-21 12:20 ` Jens Axboe
2012-09-21 12:56 ` Dmitry Monakhov
2012-09-21 13:08 ` Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox