Negative "ios_in_flight" in the 2.4 kernel

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Negative "ios_in_flight" in the 2.4 kernel
@ 2004-12-22  5:05 M. Edward Borasky
  2004-12-22 11:16 ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: M. Edward Borasky @ 2004-12-22  5:05 UTC (permalink / raw)
  To: Linux Kernel Mailing List

I've been looking at some "iostat" data from a 2.4.26 machine. The
device utilizations are 100 percent, even when the disk is idle, which
is mathematically impossible. By doing some digging, I discovered this
is a kernel bug, caused by "hd->ios_in_flight" going negative. The
relevant code appears to my untrained eyes to be in
drivers/block/ll_rw_blk.c, specifically

static inline void down_ios(struct hd_struct *hd)
{
        disk_round_stats(hd);
        --hd->ios_in_flight;
}

static inline void up_ios(struct hd_struct *hd)
{
        disk_round_stats(hd);
        ++hd->ios_in_flight;
}

Question: wouldn't a simple refusal to decrement ios_in_flight in
"down_ios" if it's zero fix this, or am I missing something?

Ed Borasky
znmeb@cesmail.net

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Negative "ios_in_flight" in the 2.4 kernel
  2004-12-22  5:05 Negative "ios_in_flight" in the 2.4 kernel M. Edward Borasky
@ 2004-12-22 11:16 ` Jens Axboe
  2004-12-22 15:19   ` M. Edward Borasky
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2004-12-22 11:16 UTC (permalink / raw)
  To: M. Edward Borasky; +Cc: Linux Kernel Mailing List

On Tue, Dec 21 2004, M. Edward Borasky wrote:
> I've been looking at some "iostat" data from a 2.4.26 machine. The
> device utilizations are 100 percent, even when the disk is idle, which
> is mathematically impossible. By doing some digging, I discovered this
> is a kernel bug, caused by "hd->ios_in_flight" going negative. The
> relevant code appears to my untrained eyes to be in
> drivers/block/ll_rw_blk.c, specifically
> 
> 
> static inline void down_ios(struct hd_struct *hd)
> {
>         disk_round_stats(hd);
>         --hd->ios_in_flight;
> }
> 
> static inline void up_ios(struct hd_struct *hd)
> {
>         disk_round_stats(hd);
>         ++hd->ios_in_flight;
> }
> 
> Question: wouldn't a simple refusal to decrement ios_in_flight in
> "down_ios" if it's zero fix this, or am I missing something?

That would paper over the real bug, but it will work for you.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Negative "ios_in_flight" in the 2.4 kernel
  2004-12-22 11:16 ` Jens Axboe
@ 2004-12-22 15:19   ` M. Edward Borasky
  2004-12-22 15:58     ` Marcelo Tosatti
  0 siblings, 1 reply; 6+ messages in thread
From: M. Edward Borasky @ 2004-12-22 15:19 UTC (permalink / raw)
  To: Linux Kernel Mailing List

On Wed, 2004-12-22 at 12:16 +0100, Jens Axboe wrote:
>  
> > Question: wouldn't a simple refusal to decrement ios_in_flight in
> > "down_ios" if it's zero fix this, or am I missing something?
> 
> That would paper over the real bug, but it will work for you.
What is the "real bug", then? What will "work for me" is accurate disk
usage tick counts. The intent of these statistics is something known as
Operational Analysis of Queueing Networks. 

The "requirement" is that the operations on each device be accurately
counted, and the "wall clock" time spent *waiting* for requests and the
time spent *servicing* requests be accurately accumulated for each
device. The sector count is a bonus. 

>From these raw counters, one can, and iostat does, compute throughput,
utilization, average service time, average wait time and average queue
length. An excellent and highly readable reference for the math involved
can be found at

http://www.cs.washington.edu/homes/lazowska/qsp/Images/Chap_03.pdf

That is the intent behind these counters, and what will "work for me" is
a kernel that captures the raw counters correctly. If forcing
ios_in_flight to be non-negative is done at the expense of losing or
gaining ticks in the wait or service time accumulators, then it will not
work for me.

Ed Borasky
http://www.borasky-research.net

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Negative "ios_in_flight" in the 2.4 kernel
  2004-12-22 15:19   ` M. Edward Borasky
@ 2004-12-22 15:58     ` Marcelo Tosatti
  2004-12-23  8:08       ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Marcelo Tosatti @ 2004-12-22 15:58 UTC (permalink / raw)
  To: M. Edward Borasky; +Cc: Linux Kernel Mailing List, Jens Axboe

On Wed, Dec 22, 2004 at 07:19:42AM -0800, M. Edward Borasky wrote:
> On Wed, 2004-12-22 at 12:16 +0100, Jens Axboe wrote:
> >  
> > > Question: wouldn't a simple refusal to decrement ios_in_flight in
> > > "down_ios" if it's zero fix this, or am I missing something?
> > 
> > That would paper over the real bug, but it will work for you.
> What is the "real bug", then? What will "work for me" is accurate disk
> usage tick counts. The intent of these statistics is something known as
> Operational Analysis of Queueing Networks. 
> 
> The "requirement" is that the operations on each device be accurately
> counted, and the "wall clock" time spent *waiting* for requests and the
> time spent *servicing* requests be accurately accumulated for each
> device. The sector count is a bonus. 
> 
> >From these raw counters, one can, and iostat does, compute throughput,
> utilization, average service time, average wait time and average queue
> length. An excellent and highly readable reference for the math involved
> can be found at
> 
> http://www.cs.washington.edu/homes/lazowska/qsp/Images/Chap_03.pdf
> 
> That is the intent behind these counters, and what will "work for me" is
> a kernel that captures the raw counters correctly. If forcing
> ios_in_flight to be non-negative is done at the expense of losing or
> gaining ticks in the wait or service time accumulators, then it will not
> work for me.

Well something is deaccounting uncorrectly (doh), probably the disk/partition 
accounting logic is doing wrong in some condition, Jens?

void req_merged_io(struct request *req)
{
        struct hd_struct *hd1, *hd2;

        locate_hd_struct(req, &hd1, &hd2);
        if (hd1)
                down_ios(hd1);
        if (hd2)
                down_ios(hd2);
}

void req_finished_io(struct request *req)
{
        struct hd_struct *hd1, *hd2;

        locate_hd_struct(req, &hd1, &hd2);
        if (hd1)
                account_io_end(hd1, req);
        if (hd2)
                account_io_end(hd2, req);
}

We could eliminate that possibility if you ran your tests with a single 
non-partitioned disk, but thats just a guess.

Jens has more of a clue than I certainly.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Negative "ios_in_flight" in the 2.4 kernel
  2004-12-22 15:58     ` Marcelo Tosatti
@ 2004-12-23  8:08       ` Jens Axboe
  2004-12-23 15:30         ` M. Edward Borasky
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2004-12-23  8:08 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: M. Edward Borasky, Linux Kernel Mailing List

On Wed, Dec 22 2004, Marcelo Tosatti wrote:
> On Wed, Dec 22, 2004 at 07:19:42AM -0800, M. Edward Borasky wrote:
> > On Wed, 2004-12-22 at 12:16 +0100, Jens Axboe wrote:
> > >  
> > > > Question: wouldn't a simple refusal to decrement ios_in_flight in
> > > > "down_ios" if it's zero fix this, or am I missing something?
> > > 
> > > That would paper over the real bug, but it will work for you.
> > What is the "real bug", then? What will "work for me" is accurate disk
> > usage tick counts. The intent of these statistics is something known as
> > Operational Analysis of Queueing Networks. 
> > 
> > The "requirement" is that the operations on each device be accurately
> > counted, and the "wall clock" time spent *waiting* for requests and the
> > time spent *servicing* requests be accurately accumulated for each
> > device. The sector count is a bonus. 
> > 
> > >From these raw counters, one can, and iostat does, compute throughput,
> > utilization, average service time, average wait time and average queue
> > length. An excellent and highly readable reference for the math involved
> > can be found at
> > 
> > http://www.cs.washington.edu/homes/lazowska/qsp/Images/Chap_03.pdf
> > 
> > That is the intent behind these counters, and what will "work for me" is
> > a kernel that captures the raw counters correctly. If forcing
> > ios_in_flight to be non-negative is done at the expense of losing or
> > gaining ticks in the wait or service time accumulators, then it will not
> > work for me.
> 
> Well something is deaccounting uncorrectly (doh), probably the disk/partition 
> accounting logic is doing wrong in some condition, Jens?
> 
> void req_merged_io(struct request *req)
> {
>         struct hd_struct *hd1, *hd2;
> 
>         locate_hd_struct(req, &hd1, &hd2);
>         if (hd1)
>                 down_ios(hd1);
>         if (hd2)
>                 down_ios(hd2);
> }
> 
> void req_finished_io(struct request *req)
> {
>         struct hd_struct *hd1, *hd2;
> 
>         locate_hd_struct(req, &hd1, &hd2);
>         if (hd1)
>                 account_io_end(hd1, req);
>         if (hd2)
>                 account_io_end(hd2, req);
> }
> 
> We could eliminate that possibility if you ran your tests with a single 
> non-partitioned disk, but thats just a guess.

It would be nice to know if this was a vanilla kernel or patched in some
way. The only recent bug in this area I remember was a bad merge in the
SUSE tree with the io_request_lock scaling patch.

(and don't trim the cc list when replying, at least not if you want
people to see your message)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Negative "ios_in_flight" in the 2.4 kernel
  2004-12-23  8:08       ` Jens Axboe
@ 2004-12-23 15:30         ` M. Edward Borasky
  0 siblings, 0 replies; 6+ messages in thread
From: M. Edward Borasky @ 2004-12-23 15:30 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Marcelo Tosatti, Linux Kernel Mailing List

On Thu, 2004-12-23 at 09:08 +0100, Jens Axboe wrote:
> > We could eliminate that possibility if you ran your tests with a single 
> > non-partitioned disk, but thats just a guess.
> 
> It would be nice to know if this was a vanilla kernel or patched in some
> way. The only recent bug in this area I remember was a bad merge in the
> SUSE tree with the io_request_lock scaling patch.

I have seen this with Red Hat 2.4.18 (from RH 8.0) kernels, Gentoo
2.4.25 and 2.4.26 kernels, on both single-disk and two-disk systems. Now
that I think of it, I've seen this on both single-processor and
multi-processor systems and with both SCSI and IDE drives. I have also
seen these systems run for quite a while without ios_in_flight going
negative. And I've never seen ios_in_flight lower than -1 or higher than
0 on an idle system. So my conclusion is that an extra downcount is
fairly rare.

I saw a very similar bug listed in the LKML about a year ago. For
example, see

http://search.luky.org/linux-kernel.2004/msg00025.html

and

http://search.luky.org/linux-kernel.2004/msg03278.html

I think I'll try rebooting the two-disk box (which is easier to get one
truly idle disk on) and running bonnie++ periodically to see if I can
get steady-state ios_in_flight values other than -1 and 0 on an idle
system (between bonnie++ runs). I can set up "oprofile" on the Gentoo
boxes if that will help.

One other note: all of these systems when "idle" have a small amount of
write activity going on. The Red Hat boxes are using ext3 filesystems
and the Gentoo systems are using reiserfs. Is constant low-level writing
to be expected with journaling?

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-12-23 15:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-12-22  5:05 Negative "ios_in_flight" in the 2.4 kernel M. Edward Borasky
2004-12-22 11:16 ` Jens Axboe
2004-12-22 15:19   ` M. Edward Borasky
2004-12-22 15:58     ` Marcelo Tosatti
2004-12-23  8:08       ` Jens Axboe
2004-12-23 15:30         ` M. Edward Borasky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox