Process Scheduling Issue using sg/libata

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Process Scheduling Issue using sg/libata
@ 2007-11-17  0:49 Fajun Chen
  2007-11-17  3:02 ` Tejun Heo
  2007-11-17  4:30 ` Mark Lord
  0 siblings, 2 replies; 19+ messages in thread
From: Fajun Chen @ 2007-11-17  0:49 UTC (permalink / raw)
  To: linux-ide@vger.kernel.org, linux-scsi

Hi All,

I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
and libata version 2.00 are loaded on ARM XScale board.  Under heavy
cpu load (e.g. when blocks per transfer/sector count is set to 1),
I've observed that the test application can suck cpu away for long
time (more than 20 seconds) and other processes including high
priority shell can not get the time slice to run.  What's interesting
is that if the application is under heavy IO load (e.g. when blocks
per transfer/sector count is set to 256),  the problem goes away. I
also tested with open source code sg_utils and got the same result, so
this is not a problem specific to my user-space application.

Since user preemption is checked when the kernel is about to return to
user-space from a system call,  process scheduler should be invoked
after each system call. Something seems to be broken here.  I found a
similar issue below:
http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2
But that turns out to be an issue with MTD/JFFS2 drivers, which are
not used in my system.

Has anyone experienced similar issues with sg/libata? Any information
would be greatly appreciated.

Thanks,
Fajun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-17  0:49 Process Scheduling Issue using sg/libata Fajun Chen
@ 2007-11-17  3:02 ` Tejun Heo
  2007-11-17  6:14   ` Fajun Chen
  2007-11-17  4:30 ` Mark Lord
  1 sibling, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2007-11-17  3:02 UTC (permalink / raw)
  To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi

Fajun Chen wrote:
> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
> and libata version 2.00 are loaded on ARM XScale board.  Under heavy
> cpu load (e.g. when blocks per transfer/sector count is set to 1),
> I've observed that the test application can suck cpu away for long
> time (more than 20 seconds) and other processes including high
> priority shell can not get the time slice to run.  What's interesting
> is that if the application is under heavy IO load (e.g. when blocks
> per transfer/sector count is set to 256),  the problem goes away. I
> also tested with open source code sg_utils and got the same result, so
> this is not a problem specific to my user-space application.
> 
> Since user preemption is checked when the kernel is about to return to
> user-space from a system call,  process scheduler should be invoked
> after each system call. Something seems to be broken here.  I found a
> similar issue below:
> http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2
> But that turns out to be an issue with MTD/JFFS2 drivers, which are
> not used in my system.
> 
> Has anyone experienced similar issues with sg/libata? Any information
> would be greatly appreciated.

That's one weird story.  Does kernel say anything during that 20 seconds?

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-17  0:49 Process Scheduling Issue using sg/libata Fajun Chen
  2007-11-17  3:02 ` Tejun Heo
@ 2007-11-17  4:30 ` Mark Lord
  2007-11-17  7:20   ` Fajun Chen
  1 sibling, 1 reply; 19+ messages in thread
From: Mark Lord @ 2007-11-17  4:30 UTC (permalink / raw)
  To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo

Fajun Chen wrote:
> Hi All,
> 
> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
> and libata version 2.00 are loaded on ARM XScale board.  Under heavy
> cpu load (e.g. when blocks per transfer/sector count is set to 1),
> I've observed that the test application can suck cpu away for long
> time (more than 20 seconds) and other processes including high
> priority shell can not get the time slice to run.  What's interesting
> is that if the application is under heavy IO load (e.g. when blocks
> per transfer/sector count is set to 256),  the problem goes away. I
> also tested with open source code sg_utils and got the same result, so
> this is not a problem specific to my user-space application.
..

Post the relevant code here, and then we'll be able to better understand
and explain it to you.

For example, if the code is using ATA opcodes 0x20, 0x21, 0x24,
0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops),
then this behaviour does not surprise me in the least.  Fully expected
and difficult to avoid.

Cheers


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-17  3:02 ` Tejun Heo
@ 2007-11-17  6:14   ` Fajun Chen
  2007-11-17 17:13     ` James Chapman
  0 siblings, 1 reply; 19+ messages in thread
From: Fajun Chen @ 2007-11-17  6:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide@vger.kernel.org, linux-scsi

On 11/16/07, Tejun Heo <htejun@gmail.com> wrote:
> Fajun Chen wrote:
> > I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
> > and libata version 2.00 are loaded on ARM XScale board.  Under heavy
> > cpu load (e.g. when blocks per transfer/sector count is set to 1),
> > I've observed that the test application can suck cpu away for long
> > time (more than 20 seconds) and other processes including high
> > priority shell can not get the time slice to run.  What's interesting
> > is that if the application is under heavy IO load (e.g. when blocks
> > per transfer/sector count is set to 256),  the problem goes away. I
> > also tested with open source code sg_utils and got the same result, so
> > this is not a problem specific to my user-space application.
> >
> > Since user preemption is checked when the kernel is about to return to
> > user-space from a system call,  process scheduler should be invoked
> > after each system call. Something seems to be broken here.  I found a
> > similar issue below:
> > http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2
> > But that turns out to be an issue with MTD/JFFS2 drivers, which are
> > not used in my system.
> >
> > Has anyone experienced similar issues with sg/libata? Any information
> > would be greatly appreciated.
>
> That's one weird story.  Does kernel say anything during that 20 seconds?
>
No. Nothing in kernel log.

Fajun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-17  4:30 ` Mark Lord
@ 2007-11-17  7:20   ` Fajun Chen
  2007-11-17 16:25     ` Mark Lord
  0 siblings, 1 reply; 19+ messages in thread
From: Fajun Chen @ 2007-11-17  7:20 UTC (permalink / raw)
  To: Mark Lord; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo

On 11/16/07, Mark Lord <liml@rtr.ca> wrote:
> Fajun Chen wrote:
> > Hi All,
> >
> > I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
> > and libata version 2.00 are loaded on ARM XScale board.  Under heavy
> > cpu load (e.g. when blocks per transfer/sector count is set to 1),
> > I've observed that the test application can suck cpu away for long
> > time (more than 20 seconds) and other processes including high
> > priority shell can not get the time slice to run.  What's interesting
> > is that if the application is under heavy IO load (e.g. when blocks
> > per transfer/sector count is set to 256),  the problem goes away. I
> > also tested with open source code sg_utils and got the same result, so
> > this is not a problem specific to my user-space application.
> ..
>
> Post the relevant code here, and then we'll be able to better understand
> and explain it to you.
>
> For example, if the code is using ATA opcodes 0x20, 0x21, 0x24,
> 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops),
> then this behaviour does not surprise me in the least.  Fully expected
> and difficult to avoid.
>

This problem also happens with R/W DMA ops. Below are simplified code snippets:
    // Open one sg device for read
      if ((sg_fd  = open(dev_name, O_RDWR))<0)
      {
          ...
      }
      read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
                             MAP_SHARED, sg_fd, 0);

    // Open the same sg device for write
      if ((sg_fd_wr = open(dev_name, O_RDWR))<0)
      {
         ...
      }
      write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
                             MAP_SHARED, sg_fd_wr, 0);

      sg_io_hdr_t io_hdr;

      memset(&io_hdr, 0, sizeof(sg_io_hdr_t));

      io_hdr.interface_id = 'S';
      io_hdr.mx_sb_len    = sizeof(sense_buffer);
      io_hdr.sbp          = sense_buffer;
      io_hdr.dxfer_len    = dxfer_len;
      io_hdr.cmd_len      = cmd_len;
      io_hdr.cmdp         = cmdp;    // ATA pass through command block
      io_hdr.timeout      = cmd_tmo * 1000;       // In millisecs
      io_hdr.pack_id = id;  // Read/write counter for now
      io_hdr.iovec_count=0;   // scatter gather elements, 0=not being used

      if (direction == 1)
      {
        io_hdr.dxfer_direction = SG_DXFER_TO_DEV;
        io_hdr.flags |= SG_FLAG_MMAP_IO;
        status = ioctl(sg_fd_wr, SG_IO, &io_hdr);
      }
      else
      {
        io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
        io_hdr.flags |= SG_FLAG_MMAP_IO;
        status = ioctl(sg_fd, SG_IO, &io_hdr);
      }
      ...
Mmaped IO is a moot point here since this problem is also observed
when using direct IO.

Thanks,
Fajun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-17  7:20   ` Fajun Chen
@ 2007-11-17 16:25     ` Mark Lord
  2007-11-17 19:20       ` Fajun Chen
  0 siblings, 1 reply; 19+ messages in thread
From: Mark Lord @ 2007-11-17 16:25 UTC (permalink / raw)
  To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo

Fajun Chen wrote:
> On 11/16/07, Mark Lord <liml@rtr.ca> wrote:
>> Fajun Chen wrote:
>>> Hi All,
>>>
>>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
>>> and libata version 2.00 are loaded on ARM XScale board.  Under heavy
>>> cpu load (e.g. when blocks per transfer/sector count is set to 1),
>>> I've observed that the test application can suck cpu away for long
>>> time (more than 20 seconds) and other processes including high
>>> priority shell can not get the time slice to run.  What's interesting
>>> is that if the application is under heavy IO load (e.g. when blocks
>>> per transfer/sector count is set to 256),  the problem goes away. I
>>> also tested with open source code sg_utils and got the same result, so
>>> this is not a problem specific to my user-space application.
>> ..
>>
>> Post the relevant code here, and then we'll be able to better understand
>> and explain it to you.
>>
>> For example, if the code is using ATA opcodes 0x20, 0x21, 0x24,
>> 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops),
>> then this behaviour does not surprise me in the least.  Fully expected
>> and difficult to avoid.
>>
> 
> This problem also happens with R/W DMA ops. Below are simplified code snippets:
>     // Open one sg device for read
>       if ((sg_fd  = open(dev_name, O_RDWR))<0)
>       {
>           ...
>       }
>       read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
>                              MAP_SHARED, sg_fd, 0);
> 
>     // Open the same sg device for write
>       if ((sg_fd_wr = open(dev_name, O_RDWR))<0)
>       {
>          ...
>       }
>       write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
>                              MAP_SHARED, sg_fd_wr, 0);
..

Mmmm.. what is the purpose of those two mmap'd areas ?
I think this is important and relevant here:  what are they used for?

As coded above, these are memory mapped areas taht (1) overlap,
and (2) will be demand paged automatically to/from the disk
as they are accessed/modified.  This *will* conflict with any SG_IO
operations happening at the same time on the same device.

????



>       sg_io_hdr_t io_hdr;
> 
>       memset(&io_hdr, 0, sizeof(sg_io_hdr_t));
> 
>       io_hdr.interface_id = 'S';
>       io_hdr.mx_sb_len    = sizeof(sense_buffer);
>       io_hdr.sbp          = sense_buffer;
>       io_hdr.dxfer_len    = dxfer_len;
>       io_hdr.cmd_len      = cmd_len;
>       io_hdr.cmdp         = cmdp;    // ATA pass through command block
>       io_hdr.timeout      = cmd_tmo * 1000;       // In millisecs
>       io_hdr.pack_id = id;  // Read/write counter for now
>       io_hdr.iovec_count=0;   // scatter gather elements, 0=not being used
> 
>       if (direction == 1)
>       {
>         io_hdr.dxfer_direction = SG_DXFER_TO_DEV;
>         io_hdr.flags |= SG_FLAG_MMAP_IO;
>         status = ioctl(sg_fd_wr, SG_IO, &io_hdr);
>       }
>       else
>       {
>         io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
>         io_hdr.flags |= SG_FLAG_MMAP_IO;
>         status = ioctl(sg_fd, SG_IO, &io_hdr);
>       }
>       ...
> Mmaped IO is a moot point here since this problem is also observed
> when using direct IO.
> 
> Thanks,
> Fajun


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-17  6:14   ` Fajun Chen
@ 2007-11-17 17:13     ` James Chapman
  2007-11-17 19:37       ` Fajun Chen
  0 siblings, 1 reply; 19+ messages in thread
From: James Chapman @ 2007-11-17 17:13 UTC (permalink / raw)
  To: Fajun Chen; +Cc: Tejun Heo, linux-ide@vger.kernel.org, linux-scsi

Fajun Chen wrote:
> On 11/16/07, Tejun Heo <htejun@gmail.com> wrote:
>> Fajun Chen wrote:
>>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
>>> and libata version 2.00 are loaded on ARM XScale board.  Under heavy
>>> cpu load (e.g. when blocks per transfer/sector count is set to 1),
>>> I've observed that the test application can suck cpu away for long
>>> time (more than 20 seconds) and other processes including high
>>> priority shell can not get the time slice to run.  What's interesting
>>> is that if the application is under heavy IO load (e.g. when blocks
>>> per transfer/sector count is set to 256),  the problem goes away. I
>>> also tested with open source code sg_utils and got the same result, so
>>> this is not a problem specific to my user-space application.
>>>
>>> Since user preemption is checked when the kernel is about to return to
>>> user-space from a system call,  process scheduler should be invoked
>>> after each system call. Something seems to be broken here.  I found a
>>> similar issue below:
>>> http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2
>>> But that turns out to be an issue with MTD/JFFS2 drivers, which are
>>> not used in my system.
>>>
>>> Has anyone experienced similar issues with sg/libata? Any information
>>> would be greatly appreciated.
>> That's one weird story.  Does kernel say anything during that 20 seconds?
>>
> No. Nothing in kernel log.
> 
> Fajun

Have you considered using oprofile to find out what the CPU is doing
during the 20 seconds?

Does the problem occur when you put it under load using another method?
What are the ATA and network drivers here? I've seen some awful
out-of-tree device drivers hog the CPU with busy-waits and other crap.
Oprofile results should show the culprit.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-17 16:25     ` Mark Lord
@ 2007-11-17 19:20       ` Fajun Chen
  2007-11-17 19:55         ` Mark Lord
  0 siblings, 1 reply; 19+ messages in thread
From: Fajun Chen @ 2007-11-17 19:20 UTC (permalink / raw)
  To: Mark Lord; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo

On 11/17/07, Mark Lord <liml@rtr.ca> wrote:
> Fajun Chen wrote:
> > On 11/16/07, Mark Lord <liml@rtr.ca> wrote:
> >> Fajun Chen wrote:
> >>> Hi All,
> >>>
> >>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
> >>> and libata version 2.00 are loaded on ARM XScale board.  Under heavy
> >>> cpu load (e.g. when blocks per transfer/sector count is set to 1),
> >>> I've observed that the test application can suck cpu away for long
> >>> time (more than 20 seconds) and other processes including high
> >>> priority shell can not get the time slice to run.  What's interesting
> >>> is that if the application is under heavy IO load (e.g. when blocks
> >>> per transfer/sector count is set to 256),  the problem goes away. I
> >>> also tested with open source code sg_utils and got the same result, so
> >>> this is not a problem specific to my user-space application.
> >> ..
> >>
> >> Post the relevant code here, and then we'll be able to better understand
> >> and explain it to you.
> >>
> >> For example, if the code is using ATA opcodes 0x20, 0x21, 0x24,
> >> 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops),
> >> then this behaviour does not surprise me in the least.  Fully expected
> >> and difficult to avoid.
> >>
> >
> > This problem also happens with R/W DMA ops. Below are simplified code snippets:
> >     // Open one sg device for read
> >       if ((sg_fd  = open(dev_name, O_RDWR))<0)
> >       {
> >           ...
> >       }
> >       read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
> >                              MAP_SHARED, sg_fd, 0);
> >
> >     // Open the same sg device for write
> >       if ((sg_fd_wr = open(dev_name, O_RDWR))<0)
> >       {
> >          ...
> >       }
> >       write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
> >                              MAP_SHARED, sg_fd_wr, 0);
> ..
>
> Mmmm.. what is the purpose of those two mmap'd areas ?
> I think this is important and relevant here:  what are they used for?
>
> As coded above, these are memory mapped areas taht (1) overlap,
> and (2) will be demand paged automatically to/from the disk
> as they are accessed/modified.  This *will* conflict with any SG_IO
> operations happening at the same time on the same device.
>
> ????

The purpose of using two memory mapped areas is to meet our
requirement that certain data patterns for writing need to be kept
across commands. For instance, if one buffer is used for both reads
and writes, then this buffer will need to be re-populated with certain
write data after each read command, which would be very costly for
write-read mixed type of ops. This separate R/W buffer setting also
facilitates data comparison.

These buffers are not used at the same time (one will be used only
after the command on the other is completed). My application is the
only program accessing disk using sg/libata and the rest of the
programs run from ramdisk. Also, each buffer is only about 0.5MB and
we have 64MB RAM on the target board.
With this setup,  these two buffers should be pretty much independent
and free from block layer/file system, correct?

One thing is worthy of mentioning here. If the application is set to
low priority (nice 19) or sched_yield() is called after each R/W
command, then this issue disappears but performance suffers.

Some thoughts here. For  a static process, Linux scheduler could
assign some dynamic priority to it based on activity and age, etc. Any
chance  that the scheduler favors my application unfairly due to the
load condition?

Thanks,
Fajun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-17 17:13     ` James Chapman
@ 2007-11-17 19:37       ` Fajun Chen
  0 siblings, 0 replies; 19+ messages in thread
From: Fajun Chen @ 2007-11-17 19:37 UTC (permalink / raw)
  To: James Chapman; +Cc: Tejun Heo, linux-ide@vger.kernel.org, linux-scsi

On 11/17/07, James Chapman <jchapman@katalix.com> wrote:
> Fajun Chen wrote:
> > On 11/16/07, Tejun Heo <htejun@gmail.com> wrote:
> >> Fajun Chen wrote:
> >>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2
> >>> and libata version 2.00 are loaded on ARM XScale board.  Under heavy
> >>> cpu load (e.g. when blocks per transfer/sector count is set to 1),
> >>> I've observed that the test application can suck cpu away for long
> >>> time (more than 20 seconds) and other processes including high
> >>> priority shell can not get the time slice to run.  What's interesting
> >>> is that if the application is under heavy IO load (e.g. when blocks
> >>> per transfer/sector count is set to 256),  the problem goes away. I
> >>> also tested with open source code sg_utils and got the same result, so
> >>> this is not a problem specific to my user-space application.
> >>>
> >>> Since user preemption is checked when the kernel is about to return to
> >>> user-space from a system call,  process scheduler should be invoked
> >>> after each system call. Something seems to be broken here.  I found a
> >>> similar issue below:
> >>> http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2
> >>> But that turns out to be an issue with MTD/JFFS2 drivers, which are
> >>> not used in my system.
> >>>
> >>> Has anyone experienced similar issues with sg/libata? Any information
> >>> would be greatly appreciated.
> >> That's one weird story.  Does kernel say anything during that 20 seconds?
> >>
> > No. Nothing in kernel log.
> >
> > Fajun
>
> Have you considered using oprofile to find out what the CPU is doing
> during the 20 seconds?
>
Haven't tried oprofile yet, not sure if it will get the time slice to
run though. During this 20 seconds, I've verified that my application
is still busy with R/W ops.

> Does the problem occur when you put it under load using another method?
> What are the ATA and network drivers here? I've seen some awful
> out-of-tree device drivers hog the CPU with busy-waits and other crap.
> Oprofile results should show the culprit.
If blocks per transfer/sector count is set to 256, which means cpu has
less load (any other implications?), this problem no longer occurs.
Our target system uses libata sil24/pata680 drivers, has a customized
FIFO driver but no network driver. The relevant variable here is
blocks per transfer/sector count, which seems to matter only to
sg/libata.

Thanks,
Fajun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-17 19:20       ` Fajun Chen
@ 2007-11-17 19:55         ` Mark Lord
  2007-11-18  6:48           ` Fajun Chen
  0 siblings, 1 reply; 19+ messages in thread
From: Mark Lord @ 2007-11-17 19:55 UTC (permalink / raw)
  To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo

Fajun Chen wrote:
> On 11/17/07, Mark Lord <liml@rtr.ca> wrote:
>> Fajun Chen wrote:
>>> On 11/16/07, Mark Lord <liml@rtr.ca> wrote:
>>>> Fajun Chen wrote:
..
>>> This problem also happens with R/W DMA ops. Below are simplified code snippets:
>>>     // Open one sg device for read
>>>       if ((sg_fd  = open(dev_name, O_RDWR))<0)
>>>       {
>>>           ...
>>>       }
>>>       read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
>>>                              MAP_SHARED, sg_fd, 0);
>>>
>>>     // Open the same sg device for write
>>>       if ((sg_fd_wr = open(dev_name, O_RDWR))<0)
>>>       {
>>>          ...
>>>       }
>>>       write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
>>>                              MAP_SHARED, sg_fd_wr, 0);
>> ..
>>
>> Mmmm.. what is the purpose of those two mmap'd areas ?
>> I think this is important and relevant here:  what are they used for?
>>
>> As coded above, these are memory mapped areas taht (1) overlap,
>> and (2) will be demand paged automatically to/from the disk
>> as they are accessed/modified.  This *will* conflict with any SG_IO
>> operations happening at the same time on the same device.
..
> The purpose of using two memory mapped areas is to meet our
> requirement that certain data patterns for writing need to be kept
> across commands. For instance, if one buffer is used for both reads
> and writes, then this buffer will need to be re-populated with certain
> write data after each read command, which would be very costly for
> write-read mixed type of ops. This separate R/W buffer setting also
> facilitates data comparison.
> 
> These buffers are not used at the same time (one will be used only
> after the command on the other is completed). My application is the
> only program accessing disk using sg/libata and the rest of the
> programs run from ramdisk. Also, each buffer is only about 0.5MB and
> we have 64MB RAM on the target board.
> With this setup,  these two buffers should be pretty much independent
> and free from block layer/file system, correct?
..

No.  Those "buffers" as coded above are actually mmap'ed representations
of portions of the device (disk drive).  So any write into one of those
buffers will trigger disk writes, and just accessing ("read") the buffers
may trigger disk reads.

So what could be happening here, is when you trigger manual disk accesses
via SG_IO, that result in data being copied into those "buffers", the kernel
then automatically schedules disk writes to update the on-disk copies of
those mmap'd regions.

What you probably intended to do instead, was to use mmap to just allocate
some page-aligned RAM, not to actually mmap'd any on-disk data.  Right?

Here's how that's done:

       read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
                              MAP_SHARED|MAP_ANONYMOUS, -1, 0);

Cheers

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-17 19:55         ` Mark Lord
@ 2007-11-18  6:48           ` Fajun Chen
  2007-11-18 14:32             ` Mark Lord
  0 siblings, 1 reply; 19+ messages in thread
From: Fajun Chen @ 2007-11-18  6:48 UTC (permalink / raw)
  To: Mark Lord; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo

On 11/17/07, Mark Lord <liml@rtr.ca> wrote:
> Fajun Chen wrote:
> > On 11/17/07, Mark Lord <liml@rtr.ca> wrote:
> >> Fajun Chen wrote:
> >>> On 11/16/07, Mark Lord <liml@rtr.ca> wrote:
> >>>> Fajun Chen wrote:
> ..
> >>> This problem also happens with R/W DMA ops. Below are simplified code snippets:
> >>>     // Open one sg device for read
> >>>       if ((sg_fd  = open(dev_name, O_RDWR))<0)
> >>>       {
> >>>           ...
> >>>       }
> >>>       read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
> >>>                              MAP_SHARED, sg_fd, 0);
> >>>
> >>>     // Open the same sg device for write
> >>>       if ((sg_fd_wr = open(dev_name, O_RDWR))<0)
> >>>       {
> >>>          ...
> >>>       }
> >>>       write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
> >>>                              MAP_SHARED, sg_fd_wr, 0);
> >> ..
> >>
> >> Mmmm.. what is the purpose of those two mmap'd areas ?
> >> I think this is important and relevant here:  what are they used for?
> >>
> >> As coded above, these are memory mapped areas taht (1) overlap,
> >> and (2) will be demand paged automatically to/from the disk
> >> as they are accessed/modified.  This *will* conflict with any SG_IO
> >> operations happening at the same time on the same device.
> ..
> > The purpose of using two memory mapped areas is to meet our
> > requirement that certain data patterns for writing need to be kept
> > across commands. For instance, if one buffer is used for both reads
> > and writes, then this buffer will need to be re-populated with certain
> > write data after each read command, which would be very costly for
> > write-read mixed type of ops. This separate R/W buffer setting also
> > facilitates data comparison.
> >
> > These buffers are not used at the same time (one will be used only
> > after the command on the other is completed). My application is the
> > only program accessing disk using sg/libata and the rest of the
> > programs run from ramdisk. Also, each buffer is only about 0.5MB and
> > we have 64MB RAM on the target board.
> > With this setup,  these two buffers should be pretty much independent
> > and free from block layer/file system, correct?
> ..
>
> No.  Those "buffers" as coded above are actually mmap'ed representations
> of portions of the device (disk drive).  So any write into one of those
> buffers will trigger disk writes, and just accessing ("read") the buffers
> may trigger disk reads.
>
> So what could be happening here, is when you trigger manual disk accesses
> via SG_IO, that result in data being copied into those "buffers", the kernel
> then automatically schedules disk writes to update the on-disk copies of
> those mmap'd regions.
>
> What you probably intended to do instead, was to use mmap to just allocate
> some page-aligned RAM, not to actually mmap'd any on-disk data.  Right?
>
> Here's how that's done:
>
>       read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
>                              MAP_SHARED|MAP_ANONYMOUS, -1, 0);
>
What I intended to do is to write data into disc or read data from
disc via SG_IO as requested by my user-space application. I don't want
any automatically scheduled kernel task to sync data with disc.

I've experimented with memory mapping using MAP_ANONYMOUS as you
suggested, the good news is that it does free up the cpu load and my
system is much more responsive with the change. The bad news is that
the data read back from disc (PIO or DMA read) seems to be invisible
to user-space application. For instance, read buffer is all zeros
after Identify Device command. Is this expected side effect of
MAP_ANONYMOUS option?

Thanks,
Fajun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-18  6:48           ` Fajun Chen
@ 2007-11-18 14:32             ` Mark Lord
  2007-11-18 19:14               ` Fajun Chen
  0 siblings, 1 reply; 19+ messages in thread
From: Mark Lord @ 2007-11-18 14:32 UTC (permalink / raw)
  To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 1418 bytes --]

Fajun Chen wrote:
> On 11/17/07, Mark Lord <liml@rtr.ca> wrote:
..
>> What you probably intended to do instead, was to use mmap to just allocate
>> some page-aligned RAM, not to actually mmap'd any on-disk data.  Right?
>>
>> Here's how that's done:
>>
>>       read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
>>                              MAP_SHARED|MAP_ANONYMOUS, -1, 0);
>>
> What I intended to do is to write data into disc or read data from
> disc via SG_IO as requested by my user-space application. I don't want
> any automatically scheduled kernel task to sync data with disc.
..

Right.  Then you definitely do NOT want to mmap your device,
because that's exactly what would otherwise happen, by design!


> I've experimented with memory mapping using MAP_ANONYMOUS as you
> suggested, the good news is that it does free up the cpu load and my
> system is much more responsive with the change.
..

Yes, that's what we expected to see.


> The bad news is that
> the data read back from disc (PIO or DMA read) seems to be invisible
> to user-space application. For instance, read buffer is all zeros
> after Identify Device command. Is this expected side effect of
> MAP_ANONYMOUS option?
..

No, that would be a side effect of some other bug in the code.

Here (attached) is a working program that performs (PACKET)IDENTIFY DEVICE
commands, using a mmap() buffer to receive the data.

Cheers

[-- Attachment #2: sg_identify.c --]
[-- Type: text/x-csrc, Size: 3678 bytes --]

/*
 * This code is copyright 2007 by Mark Lord,
 * and is made available to all under the terms
 * of the GNU General Public License v2.
 */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>

#include <linux/fs.h>
#include <linux/hdreg.h>
#include <scsi/scsi.h>
#include <scsi/sg.h>
#include <sys/mman.h>

typedef unsigned long long u64;

enum {
	ATA_CMD_PIO_IDENTIFY		= 0xec,
	ATA_CMD_PIO_PIDENTIFY		= 0xa1,

	/* normal sector size (bytes) for PIO/DMA */
	ATA_SECT_SIZE			= 512,

	ATA_16				= 0x85,
	ATA_16_LEN			= 16,

	ATA_DEV_REG_LBA			= (1 << 6),

	ATA_LBA48			= 1,

	/* data transfer protocols; only basic PIO and DMA actually work */
	ATA_PROTO_NON_DATA		= ( 3 << 1),
	ATA_PROTO_PIO_IN		= ( 4 << 1),
	ATA_PROTO_PIO_OUT		= ( 5 << 1),
	ATA_PROTO_DMA			= ( 6 << 1),
	ATA_PROTO_UDMA_IN		= (11 << 1), /* unsupported */
	ATA_PROTO_UDMA_OUT		= (12 << 1), /* unsupported */
};

/*
 * Taskfile layout for ATA_16 cdb (LBA28/LBA48):
 *
 *	cdb[ 4] = feature
 *	cdb[ 6] = nsect
 *	cdb[ 8] = lbal
 *	cdb[10] = lbam
 *	cdb[12] = lbah
 *	cdb[13] = device
 *	cdb[14] = command
 *
 * "high order byte" (hob) fields for LBA48 commands:
 *
 *	cdb[ 3] = hob_feature
 *	cdb[ 5] = hob_nsect
 *	cdb[ 7] = hob_lbal
 *	cdb[ 9] = hob_lbam
 *	cdb[11] = hob_lbah
 *
 * dxfer_direction choices:
 *
 *	SG_DXFER_TO_DEV		(writing to drive)
 *	SG_DXFER_FROM_DEV	(reading from drive)
 *	SG_DXFER_NONE		(non-data commands)
 */

static int sg_issue (int fd, unsigned char ata_op, void *buf)
{
	unsigned char cdb[ATA_16_LEN]
		= { ATA_16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
	unsigned char sense[32];
	unsigned int nsects = 1;
	struct sg_io_hdr hdr;

	cdb[ 1] = ATA_PROTO_PIO_IN;
	cdb[ 6] = nsects;
	cdb[14] = ata_op;

	memset(&hdr, 0, sizeof(struct sg_io_hdr));
	hdr.interface_id	= 'S';
	hdr.cmd_len		= ATA_16_LEN;
	hdr.mx_sb_len		= sizeof(sense);
	hdr.dxfer_direction	= SG_DXFER_FROM_DEV;
	hdr.dxfer_len		= nsects * ATA_SECT_SIZE;
	hdr.dxferp		= buf;
	hdr.cmdp		= cdb;
	hdr.sbp			= sense;
	hdr.timeout		= 5000; /* milliseconds */

	memset(sense, 0, sizeof(sense));
	if (ioctl(fd, SG_IO, &hdr) < 0) {
		perror("ioctl(SG_IO)");
		return (-1);
	}
	if (hdr.status == 0 && hdr.host_status == 0 && hdr.driver_status == 0)
		return 0; /* success */

	if (hdr.status > 0) {
		unsigned char *d = sense + 8;
		/* SCSI status is non-zero */
		fprintf(stderr, "SG_IO error: SCSI sense=0x%x/%02x/%02x, ATA=0x%02x/%02x\n",
			sense[1] & 0xf, sense[2], sense[3], d[13], d[3]);
		return -1;
	}
	/* some other error we don't know about yet */
	fprintf(stderr, "SG_IO returned: SCSI status=0x%x, host_status=0x%x, driver_status=0x%x\n",
		hdr.status, hdr.host_status, hdr.driver_status);
	return -1;
}

int main (int argc, char *argv[])
{
	const char *devpath;
	int i, rc, fd;
#if 0
	unsigned short id[ATA_SECT_SIZE / 2];
	memset(id, 0, sizeof(id));
#else
	unsigned short *id;
	id = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0);
	if (id == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}
#endif
	if (argc != 2) {
		fprintf(stderr, "%s: bad/missing parm: expected <devpath>\n", argv[0]);
		exit(1);
	}
	devpath = argv[1];

	fd = open(devpath, O_RDWR|O_NONBLOCK);
	if (fd == -1) {
		perror(devpath);
		exit(1);
	}
	rc = sg_issue(fd, ATA_CMD_PIO_IDENTIFY, id);
	if (rc != 0)
		rc = sg_issue(fd, ATA_CMD_PIO_PIDENTIFY, id);
	if (rc == 0) {
		unsigned short *d = id;
		for (i = 0; i < (256/8); ++i) {
			printf("%04x %04x %04x %04x %04x %04x %04x %04x\n",
				d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7]);
			d += 8;
		}
		exit(0);
	}
	exit(1);
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-18 14:32             ` Mark Lord
@ 2007-11-18 19:14               ` Fajun Chen
  2007-11-18 19:54                 ` Mark Lord
  0 siblings, 1 reply; 19+ messages in thread
From: Fajun Chen @ 2007-11-18 19:14 UTC (permalink / raw)
  To: Mark Lord; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 2312 bytes --]

On 11/18/07, Mark Lord <liml@rtr.ca> wrote:
> Fajun Chen wrote:
> > On 11/17/07, Mark Lord <liml@rtr.ca> wrote:
> ..
> >> What you probably intended to do instead, was to use mmap to just allocate
> >> some page-aligned RAM, not to actually mmap'd any on-disk data.  Right?
> >>
> >> Here's how that's done:
> >>
> >>       read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE,
> >>                              MAP_SHARED|MAP_ANONYMOUS, -1, 0);
> >>
> > What I intended to do is to write data into disc or read data from
> > disc via SG_IO as requested by my user-space application. I don't want
> > any automatically scheduled kernel task to sync data with disc.
> ..
>
> Right.  Then you definitely do NOT want to mmap your device,
> because that's exactly what would otherwise happen, by design!
>
>
> > I've experimented with memory mapping using MAP_ANONYMOUS as you
> > suggested, the good news is that it does free up the cpu load and my
> > system is much more responsive with the change.
> ..
>
> Yes, that's what we expected to see.
>
>
> > The bad news is that
> > the data read back from disc (PIO or DMA read) seems to be invisible
> > to user-space application. For instance, read buffer is all zeros
> > after Identify Device command. Is this expected side effect of
> > MAP_ANONYMOUS option?
> ..
>
> No, that would be a side effect of some other bug in the code.
>
> Here (attached) is a working program that performs (PACKET)IDENTIFY DEVICE
> commands, using a mmap() buffer to receive the data.
>

I verified your program works in my system and my application works as
well if changed accordingly. However, this change (indirect IO in sg
term) may come at a performance cost for IO intensive applications
since it does NOT utilize mmaped buffer managed by sg driver.  Please
see relevant sg document below:
http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330
http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio
As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag
in SG_IO. Please see source code attached. I also noticed that
MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not
be desirable as you pointed out in previous emails. So this brings up
an interesting sg usage issue: can we use MAP_ANONYMOUS with
SG_FLAG_MMAP_IO flag in SG_IO?

Thanks,
Fajun

[-- Attachment #2: sg_rbuf.c --]
[-- Type: application/octet-stream, Size: 12049 bytes --]

#define _XOPEN_SOURCE 500
#define _GNU_SOURCE  

#include <unistd.h>
#include <signal.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <linux/fs.h>
#include "sg_include.h"
#include "sg_lib.h"

/* Test code for D. Gilbert's extensions to the Linux OS SCSI generic ("sg")
   device driver.
*  Copyright (C) 1999-2004 D. Gilbert
*  This program is free software; you can redistribute it and/or modify
*  it under the terms of the GNU General Public License as published by
*  the Free Software Foundation; either version 2, or (at your option)
*  any later version.

   This program uses the SCSI command READ BUFFER on the given sg
   device, first to find out how big it is and then to read that
   buffer. The '-q' option skips the data transfer from the kernel
   DMA buffers to the user space. The '-b=num' option allows the
   buffer size (in KiB) to be specified (default is to use the
   number obtained from READ BUFFER (descriptor) SCSI command).
   The '-s=num' option allows the total size of the transfer to be
   set (in megabytes, the default is 200 MiB). The '-d' option requests
   direct io (and is overridden by '-q').
   The '-m' option request mmap-ed IO (and overrides the '-q' and '-d'
   options if they are also given).
   The ability to time transfers internally (based on gettimeofday()) has
   been added with the '-t' option.
*/


#define RB_MODE_DESC 3
#define RB_MODE_DATA 2
#define RB_DESC_LEN 4
#define RB_MIB_TO_READ 200
#define RB_OPCODE 0x3C
#define RB_CMD_LEN 10

/* #define SG_DEBUG */

#ifndef SG_FLAG_MMAP_IO
#define SG_FLAG_MMAP_IO 4
#endif

#define ME "sg_rbuf: "

static char * version_str = "4.77 20041011";

static void usage()
{
    printf("Usage: sg_rbuf [-b=num] [[-q] | [-d] | [-m]] [-s=num] [-t] "
           "[-v] [-V]\n               <generic_device>\n");
    printf("  where  -b=num   num is buffer size to use (in KiB)\n");
    printf("         -d       requests dio ('-q' overrides it)\n");
    printf("         -m       requests mmap-ed IO (overrides -q, -d)\n");
    printf("         -q       quick, don't xfer to user space\n");
    printf("         -s=num   num is total size to read (in MiB)\n");
    printf("                    default total size is 200 MiB\n");
    printf("                    max total size is 4000 MiB\n");
    printf("         -t       time the data transfer\n");
    printf("         -v       increase verbosity (more debug)\n");
    printf("         -V       print version string then exit\n");
}

int main(int argc, char * argv[])
{
    int sg_fd, res, j, m;
    unsigned int k, num;
    unsigned char rbCmdBlk [RB_CMD_LEN];
    unsigned char * rbBuff = NULL;
    void * rawp = NULL;
    unsigned char sense_buffer[32];
    int buf_capacity = 0;
    int do_quick = 0;
    int do_dio = 0;
    int do_mmap = 0;
    int do_time = 0;
    int verbose = 0;
    int buf_size = 0;
    unsigned int total_size_mib = RB_MIB_TO_READ;
    char * file_name = 0;
    size_t psz = getpagesize();
    int dio_incomplete = 0;
    struct sg_io_hdr io_hdr;
    struct timeval start_tm, end_tm;
#ifdef SG_DEBUG
    int clear = 1;
#endif

    for (j = 1; j < argc; ++j) {
        if (0 == strncmp("-b=", argv[j], 3)) {
            m = 3;
            num = sscanf(argv[j] + m, "%d", &buf_size);
            if ((1 != num) || (buf_size <= 0)) {
                printf("Couldn't decode number after '-b' switch\n");
                file_name = 0;
                break;
            }
            buf_size *= 1024;
        }
        else if (0 == strncmp("-s=", argv[j], 3)) {
            m = 3;
            num = sscanf(argv[j] + m, "%u", &total_size_mib);
            if (1 != num) {
                printf("Couldn't decode number after '-s' switch\n");
                file_name = 0;
                break;
            }
        }
        else if (0 == strcmp("-q", argv[j]))
            do_quick = 1;
        else if (0 == strcmp("-d", argv[j]))
            do_dio = 1;
        else if (0 == strcmp("-m", argv[j]))
            do_mmap = 1;
        else if (0 == strcmp("-t", argv[j]))
            do_time = 1;
        else if (0 == strcmp("-v", argv[j]))
            ++verbose;
        else if (0 == strcmp("-V", argv[j])) {
	    fprintf(stderr, ME "version: %s\n", version_str);
	    return 0;
        } else if (*argv[j] == '-') {
            printf("Unrecognized switch: %s\n", argv[j]);
            file_name = 0;
            break;
        }
        else
            file_name = argv[j];
    }
    if (0 == file_name) {
	usage();
        return 1;
    }

    sg_fd = open(file_name, O_RDONLY);
    if (sg_fd < 0) {
        perror(ME "open error");
        return 1;
    }
    /* Don't worry, being very careful not to write to a none-sg file ... */
    if (do_mmap) {
        do_dio = 0;
        do_quick = 0;
    }
    if (NULL == (rawp = malloc(512))) {
        printf(ME "out of memory (query)\n");
        return 1;
    }
    rbBuff = rawp;

    memset(rbCmdBlk, 0, RB_CMD_LEN);
    rbCmdBlk[0] = RB_OPCODE;
    rbCmdBlk[1] = RB_MODE_DESC;
    rbCmdBlk[8] = RB_DESC_LEN;
    memset(&io_hdr, 0, sizeof(struct sg_io_hdr));
    io_hdr.interface_id = 'S';
    io_hdr.cmd_len = sizeof(rbCmdBlk);
    io_hdr.mx_sb_len = sizeof(sense_buffer);
    io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
    io_hdr.dxfer_len = RB_DESC_LEN;
    io_hdr.dxferp = rbBuff;
    io_hdr.cmdp = rbCmdBlk;
    io_hdr.sbp = sense_buffer;
    io_hdr.timeout = 60000;     /* 60000 millisecs == 60 seconds */
    /* do normal IO to find RB size (not dio or mmap-ed at this stage) */

    if (ioctl(sg_fd, SG_IO, &io_hdr) < 0) {
        perror(ME "SG_IO READ BUFFER descriptor error");
        if (rawp) free(rawp);
        return 1;
    }

    /* now for the error processing */
    switch (sg_err_category3(&io_hdr)) {
    case SG_LIB_CAT_CLEAN:
        break;
    case SG_LIB_CAT_RECOVERED:
        printf("Recovered error on READ BUFFER descriptor, continuing\n");
        break;
    default: /* won't bother decoding other categories */
        sg_chk_n_print3("READ BUFFER descriptor error", &io_hdr);
        if (rawp) free(rawp);
        return 1;
    }

    buf_capacity = ((rbBuff[1] << 16) | (rbBuff[2] << 8) | rbBuff[3]);
    printf("READ BUFFER reports: buffer capacity=%d, offset boundary=%d\n",
           buf_capacity, (int)rbBuff[0]);

    if (0 == buf_size)
        buf_size = buf_capacity;
    else if (buf_size > buf_capacity) {
        printf("Requested buffer size=%d exceeds reported capacity=%d\n",
               buf_size, buf_capacity);
        if (rawp) free(rawp);
        return 1;
    }
    if (rawp) {
        free(rawp);
        rawp = NULL;
    }

    if (! do_dio) {
        k = buf_size;
        if (do_mmap && (0 != (k % psz)))
            k = ((k / psz) + 1) * psz;  /* round up to page size */
        res = ioctl(sg_fd, SG_SET_RESERVED_SIZE, &k);
        if (res < 0)
            perror(ME "SG_SET_RESERVED_SIZE error");
    }

    if (do_mmap) {
        rbBuff = mmap(NULL, buf_size, PROT_READ, MAP_SHARED, sg_fd, 0);
        if (MAP_FAILED == rbBuff) {
            if (ENOMEM == errno)
                printf(ME "mmap() out of memory, try a smaller "
                       "buffer size than %d KiB\n", buf_size / 1024);
            else
                perror(ME "error using mmap()");
            return 1;
        }
    }
    else { /* non mmap-ed IO */
        rawp = malloc(buf_size + (do_dio ? psz : 0));
        if (NULL == rawp) {
            printf(ME "out of memory (data)\n");
            return 1;
        }
        if (do_dio)    /* align to page boundary */
            rbBuff= (unsigned char *)(((unsigned long)rawp + psz - 1) &
                                      (~(psz - 1)));
        else
            rbBuff = rawp;
    }

    num = (total_size_mib * 1024U * 1024U) / (unsigned int)buf_size;
    if (do_time) {
        start_tm.tv_sec = 0;
        start_tm.tv_usec = 0;
        gettimeofday(&start_tm, NULL);
    }
    /* main data reading loop */
    for (k = 0; k < num; ++k) {
        memset(rbCmdBlk, 0, RB_CMD_LEN);
        rbCmdBlk[0] = RB_OPCODE;
        rbCmdBlk[1] = RB_MODE_DATA;
        rbCmdBlk[6] = 0xff & (buf_size >> 16);
        rbCmdBlk[7] = 0xff & (buf_size >> 8);
        rbCmdBlk[8] = 0xff & buf_size;
#ifdef SG_DEBUG
        memset(rbBuff, 0, buf_size);
#endif

        memset(&io_hdr, 0, sizeof(struct sg_io_hdr));
        io_hdr.interface_id = 'S';
        io_hdr.cmd_len = sizeof(rbCmdBlk);
        io_hdr.mx_sb_len = sizeof(sense_buffer);
        io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
        io_hdr.dxfer_len = buf_size;
        if (! do_mmap)
            io_hdr.dxferp = rbBuff;
        io_hdr.cmdp = rbCmdBlk;
        io_hdr.sbp = sense_buffer;
        io_hdr.timeout = 20000;     /* 20000 millisecs == 20 seconds */
        io_hdr.pack_id = k;
        if (do_mmap)
            io_hdr.flags |= SG_FLAG_MMAP_IO;
        else if (do_dio)
            io_hdr.flags |= SG_FLAG_DIRECT_IO;
        else if (do_quick)
            io_hdr.flags |= SG_FLAG_NO_DXFER;

        if (ioctl(sg_fd, SG_IO, &io_hdr) < 0) {
            if (ENOMEM == errno)
                printf(ME "SG_IO data; out of memory, try a smaller "
                       "buffer size than %d KiB\n", buf_size / 1024);
            else
                perror(ME "SG_IO READ BUFFER data error");
            if (rawp) free(rawp);
            return 1;
        }

        /* now for the error processing */
        switch (sg_err_category3(&io_hdr)) {
        case SG_LIB_CAT_CLEAN:
            break;
        case SG_LIB_CAT_RECOVERED:
            printf("Recovered error on READ BUFFER data, continuing\n");
            break;
        default: /* won't bother decoding other categories */
            sg_chk_n_print3("READ BUFFER data error", &io_hdr);
            if (rawp) free(rawp);
            return 1;
        }
        if (do_dio &&  
            ((io_hdr.info & SG_INFO_DIRECT_IO_MASK) != SG_INFO_DIRECT_IO))
            dio_incomplete = 1;    /* flag that dio not done (completely) */
        
#ifdef SG_DEBUG
        if (clear) {
            for (j = 0; j < buf_size; ++j) {
                if (rbBuff[j] != 0) {
                    clear = 0;
                    break;
                }
            }
        }
#endif
    }
    if ((do_time) && (start_tm.tv_sec || start_tm.tv_usec)) {
        struct timeval res_tm;
        double a, b;

        gettimeofday(&end_tm, NULL);
        res_tm.tv_sec = end_tm.tv_sec - start_tm.tv_sec;
        res_tm.tv_usec = end_tm.tv_usec - start_tm.tv_usec;
        if (res_tm.tv_usec < 0) {
            --res_tm.tv_sec;
            res_tm.tv_usec += 1000000;
        }
        a = res_tm.tv_sec;
        a += (0.000001 * res_tm.tv_usec);
        b = (double)buf_size * num;
        printf("time to read data from buffer was %d.%06d secs", 
               (int)res_tm.tv_sec, (int)res_tm.tv_usec);
        if ((a > 0.00001) && (b > 511))
            printf(", %.2f MB/sec\n", b / (a * 1000000.0));
        else
            printf("\n");
    }
    if (dio_incomplete)
        printf(">> direct IO requested but not done\n");
    printf("Read %u MiB (actual %u MiB, %u bytes), buffer size=%d KiB\n",
           total_size_mib, (num * buf_size) / 1048576, num * buf_size,
           buf_size / 1024);

    if (rawp) free(rawp);
    res = close(sg_fd);
    if (res < 0) {
        perror(ME "close error");
        return 1;
    }
#ifdef SG_DEBUG
    if (clear)
        printf("read buffer always zero\n");
    else
        printf("read buffer non-zero\n");
#endif
    return 0;
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-18 19:14               ` Fajun Chen
@ 2007-11-18 19:54                 ` Mark Lord
  2007-11-18 22:29                   ` Fajun Chen
  0 siblings, 1 reply; 19+ messages in thread
From: Mark Lord @ 2007-11-18 19:54 UTC (permalink / raw)
  To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo

Fajun Chen wrote:
>..
> I verified your program works in my system and my application works as
> well if changed accordingly. However, this change (indirect IO in sg
> term) may come at a performance cost for IO intensive applications
> since it does NOT utilize mmaped buffer managed by sg driver.  Please
> see relevant sg document below:
> http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330
> http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio
> As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag
> in SG_IO. Please see source code attached. I also noticed that
> MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not
> be desirable as you pointed out in previous emails. So this brings up
> an interesting sg usage issue: can we use MAP_ANONYMOUS with
> SG_FLAG_MMAP_IO flag in SG_IO?
..

The SG_FLAG_MMAP works only with /dev/sg* devices, not /dev/sd* devices.
I don't know which kind you were trying to use, since you still have
not provided your source code for examination.

If you are using /dev/sg*, then you should be able to get your original mmap()
code to work.  But the behaviour described thus far seems to indicate that
your secret program must have been using /dev/sd* instead.

Cheers

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-18 19:54                 ` Mark Lord
@ 2007-11-18 22:29                   ` Fajun Chen
  2007-11-18 23:07                     ` Mark Lord
  0 siblings, 1 reply; 19+ messages in thread
From: Fajun Chen @ 2007-11-18 22:29 UTC (permalink / raw)
  To: Mark Lord; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo

[-- Attachment #1: Type: text/plain, Size: 2164 bytes --]

On 11/18/07, Mark Lord <liml@rtr.ca> wrote:
> Fajun Chen wrote:
> >..
> > I verified your program works in my system and my application works as
> > well if changed accordingly. However, this change (indirect IO in sg
> > term) may come at a performance cost for IO intensive applications
> > since it does NOT utilize mmaped buffer managed by sg driver.  Please
> > see relevant sg document below:
> > http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330
> > http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio
> > As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag
> > in SG_IO. Please see source code attached. I also noticed that
> > MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not
> > be desirable as you pointed out in previous emails. So this brings up
> > an interesting sg usage issue: can we use MAP_ANONYMOUS with
> > SG_FLAG_MMAP_IO flag in SG_IO?
> ..
>
> The SG_FLAG_MMAP works only with /dev/sg* devices, not /dev/sd* devices.
> I don't know which kind you were trying to use, since you still have
> not provided your source code for examination.
>
> If you are using /dev/sg*, then you should be able to get your original mmap()
> code to work.  But the behaviour described thus far seems to indicate that
> your secret program must have been using /dev/sd* instead.
>
As a matter of fact, I'm using /dev/sg*.  Due to the size of my test
application, I have not be able to compress it into a small and
publishable form. However, this issue can be easily reproduced on my
ARM XScale target using sg3_util code as follows:
1. Run printtime.c attached,  which prints message to console in a loop.
2. Run sgm_dd (part of sg3_util package, source code attached) on the
same system as follows:
>sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1
The print task can be delayed for as many as 25 seconds. Surprisingly,
I can't reproduce the problem in an i386 test system with a more
powerful processor.

Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a
bug and more testing, this option seems to make no difference to cpu
load. Sorry about previous report. Back to the drawing board now :-)

Thanks,
Fajun

[-- Attachment #2: printtime.c --]
[-- Type: application/octet-stream, Size: 551 bytes --]

#include <stdio.h>
#include <time.h>
#include <string.h>

void TimeDateStamp(char *current_time)
{
  time_t current_time_seconds; // current time in seconds
  int size;

  time(&current_time_seconds); // get the current time
  sprintf(current_time,"%s",ctime(&current_time_seconds)); // convert
  to time-date stamp
  size = strlen(current_time);
  current_time[size-1] = '\0';
}


int main()
{
  char ts[80];

  int i = 0;
  while (++i)
 {
   TimeDateStamp(ts);
   printf("[%s]In loop %d\n", ts, i);
  };
  return 0;
}


[-- Attachment #3: sgm_dd.c --]
[-- Type: application/octet-stream, Size: 35309 bytes --]

#define _XOPEN_SOURCE 500
#define _GNU_SOURCE

#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <ctype.h>
#include <errno.h>
#include <limits.h>
#include <sys/ioctl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/sysmacros.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <linux/major.h> 
#include <linux/fs.h> 
#include "sg_include.h"
#include "sg_lib.h"
#include "sg_cmds.h"
#include "llseek.h"

/* A utility program for copying files. Specialised for "files" that
*  represent devices that understand the SCSI command set.
*
*  Copyright (C) 1999 - 2004 D. Gilbert and P. Allworth
*  This program is free software; you can redistribute it and/or modify
*  it under the terms of the GNU General Public License as published by
*  the Free Software Foundation; either version 2, or (at your option)
*  any later version.

   This program is a specialisation of the Unix "dd" command in which
   either the input or the output file is a scsi generic device or a
   raw device. The block size ('bs') is assumed to be 512 if not given. 
   This program complains if 'ibs' or 'obs' are given with a value
   that differs from 'bs' (or the default 512).
   If 'if' is not given or 'if=-' then stdin is assumed. If 'of' is
   not given or 'of=-' then stdout assumed. Multipliers:
     'c','C'  *1       'b','B' *512      'k' *1024      'K' *1000
     'm' *(1024^2)     'M' *(1000^2)     'g' *(1024^3)  'G' *(1000^3)
     't' *(1024^4)     'T' *(1000^4)

   A non-standard argument "bpt" (blocks per transfer) is added to control
   the maximum number of blocks in each transfer. The default value is 128.
   For example if "bs=512" and "bpt=32" then a maximum of 32 blocks (16 KiB
   in this case) is transferred to or from the sg device in a single SCSI
   command.

   This version uses memory-mapped IO (i.e. mmap() call from the user
   space) to speed transfers. If both sides of copy are sg devices
   then only the read side will be mmap-ed, while the write side will
   use normal IO.

   This version is designed for the linux kernel 2.4 and 2.6 series.
*/

static char * version_str = "1.16 20041102";

#define DEF_BLOCK_SIZE 512
#define DEF_BLOCKS_PER_TRANSFER 128
#define DEF_SCSI_CDBSZ 10
#define MAX_SCSI_CDBSZ 16

#define ME "sgm_dd: "

/* #define SG_DEBUG */

#ifndef SG_FLAG_MMAP_IO
#define SG_FLAG_MMAP_IO 4
#endif

#define SENSE_BUFF_LEN 32       /* Arbitrary, could be larger */
#define READ_CAP_REPLY_LEN 8
#define RCAP16_REPLY_LEN 32

#ifndef SERVICE_ACTION_IN
#define SERVICE_ACTION_IN     0x9e
#endif
#ifndef SAI_READ_CAPACITY_16
#define SAI_READ_CAPACITY_16  0x10
#endif

#define DEF_TIMEOUT 60000       /* 60,000 millisecs == 60 seconds */

#ifndef RAW_MAJOR
#define RAW_MAJOR 255   /*unlikey value */
#endif 

#define FT_OTHER 1              /* filetype other than one of following */
#define FT_SG 2                 /* filetype is sg char device */
#define FT_RAW 4                /* filetype is raw char device */
#define FT_DEV_NULL 8           /* either "/dev/null" or "." as filename */
#define FT_ST 16                /* filetype is st char device (tape) */
#define FT_BLOCK 32             /* filetype is a block device */

#define DEV_NULL_MINOR_NUM 3

static int sum_of_resids = 0;

static long long dd_count = -1;
static long long in_full = 0;
static int in_partial = 0;
static long long out_full = 0;
static int out_partial = 0;

static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio";


static void install_handler (int sig_num, void (*sig_handler) (int sig))
{
    struct sigaction sigact;
    sigaction (sig_num, NULL, &sigact);
    if (sigact.sa_handler != SIG_IGN)
    {
        sigact.sa_handler = sig_handler;
        sigemptyset (&sigact.sa_mask);
        sigact.sa_flags = 0;
        sigaction (sig_num, &sigact, NULL);
    }
}

void print_stats()
{
    if (0 != dd_count)
        fprintf(stderr, "  remaining block count=%lld\n", dd_count);
    fprintf(stderr, "%lld+%d records in\n", in_full - in_partial, in_partial);
    fprintf(stderr, "%lld+%d records out\n", out_full - out_partial, 
            out_partial);
}

static void interrupt_handler(int sig)
{
    struct sigaction sigact;

    sigact.sa_handler = SIG_DFL;
    sigemptyset (&sigact.sa_mask);
    sigact.sa_flags = 0;
    sigaction (sig, &sigact, NULL);
    fprintf(stderr, "Interrupted by signal,");
    print_stats ();
    kill (getpid (), sig);
}

static void siginfo_handler(int sig)
{
    sig = sig;  /* dummy to stop -W warning messages */
    fprintf(stderr, "Progress report, continuing ...\n");
    print_stats ();
}

int dd_filetype(const char * filename)
{
    struct stat st;
    size_t len = strlen(filename);

    if ((1 == len) && ('.' == filename[0]))
        return FT_DEV_NULL;
    if (stat(filename, &st) < 0)
        return FT_OTHER;
    if (S_ISCHR(st.st_mode)) {
        if ((MEM_MAJOR == major(st.st_rdev)) &&
            (DEV_NULL_MINOR_NUM == minor(st.st_rdev)))
            return FT_DEV_NULL;
        if (RAW_MAJOR == major(st.st_rdev))
            return FT_RAW;
        if (SCSI_GENERIC_MAJOR == major(st.st_rdev))
            return FT_SG;
        if (SCSI_TAPE_MAJOR == major(st.st_rdev))
            return FT_ST;
    } else if (S_ISBLK(st.st_mode))
        return FT_BLOCK;
    return FT_OTHER;
}

void usage()
{
    fprintf(stderr, "Usage: "
           "sgm_dd  [if=<infile>] [skip=<n>] [of=<ofile>] [seek=<n>]\n"
           "               [bs=<num>] [bpt=<num>] [count=<n>] [time=<n>]\n"
           "               [cdbsz=6|10|12|16] [fua=0|1|2|3] [sync=0|1]\n"
           "               [dio=0|1] [--version]\n"
           " 'bs'  must be device block size (default 512)\n"
           " 'bpt' is blocks_per_transfer (default is 128)\n"
           " 'time' 0->no timing(def), 1->time plus calculate throughput\n"
           " 'fua' force unit access: 0->don't(def), 1->of, 2->if, 3->of+if\n"
           " 'sync' 0->no sync(def), 1->SYNCHRONIZE CACHE after xfer\n"
           " 'cdbsz' size of SCSI READ or WRITE command (default is 10)\n"
           " 'dio'  0->indirect IO on write, 1->direct IO on write\n"
           "        (only when read side is sg device (using mmap))\n");
}

/* Return of 0 -> success, -1 -> failure, 2 -> try again */
int scsi_read_capacity(int sg_fd, long long * num_sect, int * sect_sz)
{
    int k, res;
    unsigned char rcBuff[RCAP16_REPLY_LEN];

    res = sg_ll_readcap_10(sg_fd, 0, 0, rcBuff, READ_CAP_REPLY_LEN, 0);
    if (0 != res)
        return res;

    if ((0xff == rcBuff[0]) && (0xff == rcBuff[1]) && (0xff == rcBuff[2]) &&
        (0xff == rcBuff[3])) {
        long long ls;

        res = sg_ll_readcap_16(sg_fd, 0, 0, rcBuff, RCAP16_REPLY_LEN, 0);
        if (0 != res)
            return res;
        for (k = 0, ls = 0; k < 8; ++k) {
            ls <<= 8;
            ls |= rcBuff[k];
        }
        *num_sect = ls + 1;
        *sect_sz = (rcBuff[8] << 24) | (rcBuff[9] << 16) |
                   (rcBuff[10] << 8) | rcBuff[11];
    } else {
        *num_sect = 1 + ((rcBuff[0] << 24) | (rcBuff[1] << 16) |
                    (rcBuff[2] << 8) | rcBuff[3]);
        *sect_sz = (rcBuff[4] << 24) | (rcBuff[5] << 16) |
                   (rcBuff[6] << 8) | rcBuff[7];
    }
    return 0;
}

/* Return of 0 -> success, -1 -> failure */
int read_blkdev_capacity(int sg_fd, long long * num_sect, int * sect_sz)
{
#ifdef BLKGETSIZE64
    unsigned long long ull;

    if (ioctl(sg_fd, BLKGETSIZE64, &ull) < 0) {

        perror("BLKGETSIZE64 ioctl error");
        return -1;
    }
    if (ioctl(sg_fd, BLKSSZGET, sect_sz) < 0) {
        perror("BLKSSZGET ioctl error");
        return -1;
    }
    *num_sect = ((long long)ull / (long long)*sect_sz);
#else
    unsigned long ul;

    if (ioctl(sg_fd, BLKGETSIZE, &ul) < 0) {
        perror("BLKGETSIZE ioctl error");
        return -1;
    }
    *num_sect = (long long)ul;
    if (ioctl(sg_fd, BLKSSZGET, sect_sz) < 0) {
        perror("BLKSSZGET ioctl error");
        return -1;
    }
#endif
    return 0;
}


int sg_build_scsi_cdb(unsigned char * cdbp, int cdb_sz, unsigned int blocks,
                      long long start_block, int write_true, int fua,
                      int dpo)
{
    int rd_opcode[] = {0x8, 0x28, 0xa8, 0x88};
    int wr_opcode[] = {0xa, 0x2a, 0xaa, 0x8a};
    int sz_ind;

    memset(cdbp, 0, cdb_sz);
    if (dpo)
        cdbp[1] |= 0x10;
    if (fua)
        cdbp[1] |= 0x8;
    switch (cdb_sz) {
    case 6:
        sz_ind = 0;
        cdbp[0] = (unsigned char)(write_true ? wr_opcode[sz_ind] :
                                               rd_opcode[sz_ind]);
        cdbp[1] = (unsigned char)((start_block >> 16) & 0x1f);
        cdbp[2] = (unsigned char)((start_block >> 8) & 0xff);
        cdbp[3] = (unsigned char)(start_block & 0xff);
        cdbp[4] = (256 == blocks) ? 0 : (unsigned char)blocks;
        if (blocks > 256) {
            fprintf(stderr, ME "for 6 byte commands, maximum number of "
                            "blocks is 256\n");
            return 1;
        }
        if ((start_block + blocks - 1) & (~0x1fffff)) {
            fprintf(stderr, ME "for 6 byte commands, can't address blocks"
                            " beyond %d\n", 0x1fffff);
            return 1;
        }
        if (dpo || fua) {
            fprintf(stderr, ME "for 6 byte commands, neither dpo nor fua"
                            " bits supported\n");
            return 1;
        }
        break;
    case 10:
        sz_ind = 1;
        cdbp[0] = (unsigned char)(write_true ? wr_opcode[sz_ind] :
                                               rd_opcode[sz_ind]);
        cdbp[2] = (unsigned char)((start_block >> 24) & 0xff);
        cdbp[3] = (unsigned char)((start_block >> 16) & 0xff);
        cdbp[4] = (unsigned char)((start_block >> 8) & 0xff);
        cdbp[5] = (unsigned char)(start_block & 0xff);
        cdbp[7] = (unsigned char)((blocks >> 8) & 0xff);
        cdbp[8] = (unsigned char)(blocks & 0xff);
        if (blocks & (~0xffff)) {
            fprintf(stderr, ME "for 10 byte commands, maximum number of "
                            "blocks is %d\n", 0xffff);
            return 1;
        }
        break;
    case 12:
        sz_ind = 2;
        cdbp[0] = (unsigned char)(write_true ? wr_opcode[sz_ind] :
                                               rd_opcode[sz_ind]);
        cdbp[2] = (unsigned char)((start_block >> 24) & 0xff);
        cdbp[3] = (unsigned char)((start_block >> 16) & 0xff);
        cdbp[4] = (unsigned char)((start_block >> 8) & 0xff);
        cdbp[5] = (unsigned char)(start_block & 0xff);
        cdbp[6] = (unsigned char)((blocks >> 24) & 0xff);
        cdbp[7] = (unsigned char)((blocks >> 16) & 0xff);
        cdbp[8] = (unsigned char)((blocks >> 8) & 0xff);
        cdbp[9] = (unsigned char)(blocks & 0xff);
        break;
    case 16:
        sz_ind = 3;
        cdbp[0] = (unsigned char)(write_true ? wr_opcode[sz_ind] :
                                               rd_opcode[sz_ind]);
        cdbp[2] = (unsigned char)((start_block >> 56) & 0xff);
        cdbp[3] = (unsigned char)((start_block >> 48) & 0xff);
        cdbp[4] = (unsigned char)((start_block >> 40) & 0xff);
        cdbp[5] = (unsigned char)((start_block >> 32) & 0xff);
        cdbp[6] = (unsigned char)((start_block >> 24) & 0xff);
        cdbp[7] = (unsigned char)((start_block >> 16) & 0xff);
        cdbp[8] = (unsigned char)((start_block >> 8) & 0xff);
        cdbp[9] = (unsigned char)(start_block & 0xff);
        cdbp[10] = (unsigned char)((blocks >> 24) & 0xff);
        cdbp[11] = (unsigned char)((blocks >> 16) & 0xff);
        cdbp[12] = (unsigned char)((blocks >> 8) & 0xff);
        cdbp[13] = (unsigned char)(blocks & 0xff);
        break;
    default:
        fprintf(stderr, ME "expected cdb size of 6, 10, 12, or 16 but got"
                        "=%d\n", cdb_sz);
        return 1;
    }
    return 0;
}

/* -1 -> unrecoverable error, 0 -> successful, 1 -> recoverable (ENOMEM),
   2 -> try again */
int sg_read(int sg_fd, unsigned char * buff, int blocks, long long from_block,
            int bs, int cdbsz, int fua, int do_mmap)
{
    unsigned char rdCmd[MAX_SCSI_CDBSZ];
    unsigned char senseBuff[SENSE_BUFF_LEN];
    struct sg_io_hdr io_hdr;
    int res;

    if (sg_build_scsi_cdb(rdCmd, cdbsz, blocks, from_block, 0, fua, 0)) {
        fprintf(stderr, ME "bad rd cdb build, from_block=%lld, blocks=%d\n",
                from_block, blocks);
        return -1;
    }
    memset(&io_hdr, 0, sizeof(struct sg_io_hdr));
    io_hdr.interface_id = 'S';
    io_hdr.cmd_len = cdbsz;
    io_hdr.cmdp = rdCmd;
    io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
    io_hdr.dxfer_len = bs * blocks;
    if (! do_mmap)
        io_hdr.dxferp = buff;
    io_hdr.mx_sb_len = SENSE_BUFF_LEN;
    io_hdr.sbp = senseBuff;
    io_hdr.timeout = DEF_TIMEOUT;
    io_hdr.pack_id = (int)from_block;
    if (do_mmap)
        io_hdr.flags |= SG_FLAG_MMAP_IO;

    while (((res = write(sg_fd, &io_hdr, sizeof(io_hdr))) < 0) &&
           (EINTR == errno))
        ;
    if (res < 0) {
        if (ENOMEM == errno)
            return 1;
        perror("reading (wr) on sg device, error");
        return -1;
    }

    while (((res = read(sg_fd, &io_hdr, sizeof(io_hdr))) < 0) &&
           (EINTR == errno))
        ;
    if (res < 0) {
        perror("reading (rd) on sg device, error");
        return -1;
    }
    switch (sg_err_category3(&io_hdr)) {
    case SG_LIB_CAT_CLEAN:
        break;
    case SG_LIB_CAT_RECOVERED:
        fprintf(stderr, "Recovered error while reading block=%lld, num=%d\n",
               from_block, blocks);
        break;
    case SG_LIB_CAT_MEDIA_CHANGED:
        return 2;
    default:
        sg_chk_n_print3("reading", &io_hdr);
        return -1;
    }
    sum_of_resids += io_hdr.resid;
#ifdef SG_DEBUG
    fprintf(stderr, "duration=%u ms\n", io_hdr.duration);
#endif
    return 0;
}

/* -1 -> unrecoverable error, 0 -> successful, 1 -> recoverable (ENOMEM),
   2 -> try again */
int sg_write(int sg_fd, unsigned char * buff, int blocks, long long to_block,
             int bs, int cdbsz, int fua, int do_mmap, int * diop)
{
    unsigned char wrCmd[MAX_SCSI_CDBSZ];
    unsigned char senseBuff[SENSE_BUFF_LEN];
    struct sg_io_hdr io_hdr;
    int res;

    if (sg_build_scsi_cdb(wrCmd, cdbsz, blocks, to_block, 1, fua, 0)) {
        fprintf(stderr, ME "bad wr cdb build, to_block=%lld, blocks=%d\n",
                to_block, blocks);
        return -1;
    }

    memset(&io_hdr, 0, sizeof(struct sg_io_hdr));
    io_hdr.interface_id = 'S';
    io_hdr.cmd_len = cdbsz;
    io_hdr.cmdp = wrCmd;
    io_hdr.dxfer_direction = SG_DXFER_TO_DEV;
    io_hdr.dxfer_len = bs * blocks;
    if (! do_mmap)
        io_hdr.dxferp = buff;
    io_hdr.mx_sb_len = SENSE_BUFF_LEN;
    io_hdr.sbp = senseBuff;
    io_hdr.timeout = DEF_TIMEOUT;
    io_hdr.pack_id = (int)to_block;
    if (do_mmap)
        io_hdr.flags |= SG_FLAG_MMAP_IO;
    if (diop && *diop)
        io_hdr.flags |= SG_FLAG_DIRECT_IO;

    while (((res = write(sg_fd, &io_hdr, sizeof(io_hdr))) < 0) &&
           (EINTR == errno))
        ;
    if (res < 0) {
        if (ENOMEM == errno)
            return 1;
        perror("writing (wr) on sg device, error");
        return -1;
    }

    while (((res = read(sg_fd, &io_hdr, sizeof(io_hdr))) < 0) &&
           (EINTR == errno))
        ;
    if (res < 0) {
        perror("writing (rd) on sg device, error");
        return -1;
    }
    switch (sg_err_category3(&io_hdr)) {
    case SG_LIB_CAT_CLEAN:
        break;
    case SG_LIB_CAT_RECOVERED:
        fprintf(stderr, "Recovered error while writing block=%lld, num=%d\n",
               to_block, blocks);
        break;
    case SG_LIB_CAT_MEDIA_CHANGED:
        return 2;
    default:
        sg_chk_n_print3("writing", &io_hdr);
        return -1;
    }
    if (diop && *diop &&
        ((io_hdr.info & SG_INFO_DIRECT_IO_MASK) != SG_INFO_DIRECT_IO))
        *diop = 0;      /* flag that dio not done (completely) */
    return 0;
}

#define STR_SZ 1024
#define INOUTF_SZ 512
#define EBUFF_SZ 512


int main(int argc, char * argv[])
{
    long long skip = 0;
    long long seek = 0;
    int bs = 0;
    int ibs = 0;
    int obs = 0;
    int bpt = DEF_BLOCKS_PER_TRANSFER;
    char str[STR_SZ];
    char * key;
    char * buf;
    char inf[INOUTF_SZ];
    int in_type = FT_OTHER;
    char outf[INOUTF_SZ];
    int out_type = FT_OTHER;
    int res, k, t;
    int infd, outfd, blocks;
    unsigned char * wrkPos;
    unsigned char * wrkBuff = NULL;
    unsigned char * wrkMmap = NULL;
    long long in_num_sect = -1;
    int in_res_sz = 0;
    long long out_num_sect = -1;
    int out_res_sz = 0;
    int do_time = 0;
    int scsi_cdbsz_in = DEF_SCSI_CDBSZ;
    int scsi_cdbsz_out = DEF_SCSI_CDBSZ;
    int do_sync = 0;
    int do_dio = 0;
    int num_dio_not_done = 0;
    int fua_mode = 0;
    int in_sect_sz, out_sect_sz;
    char ebuff[EBUFF_SZ];
    int blocks_per;
    long long req_count;
    size_t psz = getpagesize();
    struct timeval start_tm, end_tm;

    inf[0] = '\0';
    outf[0] = '\0';
    if (argc < 2) {
        usage();
        return 1;
    }

    for(k = 1; k < argc; k++) {
        if (argv[k])
            strncpy(str, argv[k], STR_SZ);
        else
            continue;
        for(key = str, buf = key; *buf && *buf != '=';)
            buf++;
        if (*buf)
            *buf++ = '\0';
        if (strcmp(key,"if") == 0) {
            if ('\0' != inf[0]) {
                fprintf(stderr, "Second 'if=' argument??\n");
                return 1;
            } else
                strncpy(inf, buf, INOUTF_SZ);
        } else if (strcmp(key,"of") == 0) {
            if ('\0' != outf[0]) {
                fprintf(stderr, "Second 'of=' argument??\n");
                return 1;
            } else
                strncpy(outf, buf, INOUTF_SZ);
        } else if (0 == strcmp(key,"ibs"))
            ibs = sg_get_num(buf);
        else if (0 == strcmp(key,"obs"))
            obs = sg_get_num(buf);
        else if (0 == strcmp(key,"bs"))
            bs = sg_get_num(buf);
        else if (0 == strcmp(key,"bpt"))
            bpt = sg_get_num(buf);
        else if (0 == strcmp(key,"skip"))
            skip = sg_get_llnum(buf);
        else if (0 == strcmp(key,"seek"))
            seek = sg_get_llnum(buf);
        else if (0 == strcmp(key,"count"))
            dd_count = sg_get_llnum(buf);
        else if (0 == strcmp(key,"time"))
            do_time = sg_get_num(buf);
        else if (0 == strcmp(key,"cdbsz")) {
            scsi_cdbsz_in = sg_get_num(buf);
            scsi_cdbsz_out = scsi_cdbsz_in;
        } else if (0 == strcmp(key,"fua"))
            fua_mode = sg_get_num(buf);
        else if (0 == strcmp(key,"sync"))
            do_sync = sg_get_num(buf);
        else if (0 == strcmp(key,"dio"))
            do_dio = sg_get_num(buf);
        else if (0 == strncmp(key, "--vers", 6)) {
            fprintf(stderr, ME "for Linux sg version 3 driver: %s\n",
                    version_str);
            return 0;
        }
        else {
            fprintf(stderr, "Unrecognized argument '%s'\n", key);
            usage();
            return 1;
        }
    }
    if (bs <= 0) {
        bs = DEF_BLOCK_SIZE;
        fprintf(stderr, "Assume default 'bs' (block size) of %d bytes\n", bs);
    }
    if ((ibs && (ibs != bs)) || (obs && (obs != bs))) {
        fprintf(stderr, "If 'ibs' or 'obs' given must be same as 'bs'\n");
        usage();
        return 1;
    }
    if ((skip < 0) || (seek < 0)) {
        fprintf(stderr, "skip and seek cannot be negative\n");
        return 1;
    }
    if (bpt < 1) {
        fprintf(stderr, "bpt must be greater than 0\n");
        return 1;
    }
#ifdef SG_DEBUG
    fprintf(stderr, ME "if=%s skip=%lld of=%s seek=%lld count=%lld\n",
           inf, skip, outf, seek, dd_count);
#endif
    install_handler (SIGINT, interrupt_handler);
    install_handler (SIGQUIT, interrupt_handler);
    install_handler (SIGPIPE, interrupt_handler);
    install_handler (SIGUSR1, siginfo_handler);

    infd = STDIN_FILENO;
    outfd = STDOUT_FILENO;
    if (inf[0] && ('-' != inf[0])) {
        in_type = dd_filetype(inf);

        if (FT_ST == in_type) {
            fprintf(stderr, ME "unable to use scsi tape device %s\n", inf);
            return 1;
        }
        else if (FT_SG == in_type) {
            if ((infd = open(inf, O_RDWR)) < 0) {
                snprintf(ebuff, EBUFF_SZ, 
                         ME "could not open %s for sg reading", inf);
                perror(ebuff);
                return 1;
            }
            res = ioctl(infd, SG_GET_VERSION_NUM, &t);
            if ((res < 0) || (t < 30122)) {
                fprintf(stderr, ME "sg driver prior to 3.1.22\n");
                return 1;
            }
            in_res_sz = bs * bpt;
            if (0 != (in_res_sz % psz)) /* round up to next page */
                in_res_sz = ((in_res_sz / psz) + 1) * psz;
            if (ioctl(infd, SG_GET_RESERVED_SIZE, &t) < 0) {
                perror(ME "SG_GET_RESERVED_SIZE error");
                return 1;
            }
            if (in_res_sz > t) {
                if (ioctl(infd, SG_SET_RESERVED_SIZE, &in_res_sz) < 0) {
                    perror(ME "SG_SET_RESERVED_SIZE error");
                    return 1;
                }
            }
            wrkMmap = mmap(NULL, in_res_sz, PROT_READ | PROT_WRITE, 
                           MAP_SHARED, infd, 0);
            if (MAP_FAILED == wrkMmap) {
                snprintf(ebuff, EBUFF_SZ,
                         ME "error using mmap() on file: %s", inf);
                perror(ebuff);
                return 1;
            }
        }
        else {
            if ((infd = open(inf, O_RDONLY)) < 0) {
                snprintf(ebuff, EBUFF_SZ,
                         ME "could not open %s for reading", inf);
                perror(ebuff);
                return 1;
            }
            else if (skip > 0) {
                llse_loff_t offset = skip;

                offset *= bs;       /* could exceed 32 bits here! */
                if (llse_llseek(infd, offset, SEEK_SET) < 0) {
                    snprintf(ebuff, EBUFF_SZ, ME "couldn't skip to "
                             "required position on %s", inf);
                    perror(ebuff);
                    return 1;
                }
            }
        }
    }

    if (outf[0] && ('-' != outf[0])) {
        out_type = dd_filetype(outf);

        if (FT_ST == out_type) {
            fprintf(stderr, ME "unable to use scsi tape device %s\n", outf);
            return 1;
        }
        else if (FT_SG == out_type) {
            if ((outfd = open(outf, O_RDWR)) < 0) {
                snprintf(ebuff, EBUFF_SZ, ME "could not open %s for "
                         "sg writing", outf);
                perror(ebuff);
                return 1;
            }
            res = ioctl(outfd, SG_GET_VERSION_NUM, &t);
            if ((res < 0) || (t < 30122)) {
                fprintf(stderr, ME "sg driver prior to 3.1.22\n");
                return 1;
            }
            if (ioctl(outfd, SG_GET_RESERVED_SIZE, &t) < 0) {
                perror(ME "SG_GET_RESERVED_SIZE error");
                return 1;
            }
            out_res_sz = bs * bpt;
            if (out_res_sz > t) {
                if (ioctl(outfd, SG_SET_RESERVED_SIZE, &out_res_sz) < 0) {
                    perror(ME "SG_SET_RESERVED_SIZE error");
                    return 1;
                }
            }
            if (NULL == wrkMmap) {
                wrkMmap = mmap(NULL, out_res_sz, PROT_READ | PROT_WRITE, 
                               MAP_SHARED, outfd, 0);
                if (MAP_FAILED == wrkMmap) {
                    snprintf(ebuff, EBUFF_SZ,
                             ME "error using mmap() on file: %s", outf);
                    perror(ebuff);
                    return 1;
                }
            }
        }
        else if (FT_DEV_NULL == out_type)
            outfd = -1; /* don't bother opening */
        else {
            if (FT_RAW != out_type) {
                if ((outfd = open(outf, O_WRONLY | O_CREAT, 0666)) < 0) {
                    snprintf(ebuff, EBUFF_SZ,
                             ME "could not open %s for writing", outf);
                    perror(ebuff);
                    return 1;
                }
            }
            else {
                if ((outfd = open(outf, O_WRONLY)) < 0) {
                    snprintf(ebuff, EBUFF_SZ, ME "could not open %s "
                             "for raw writing", outf);
                    perror(ebuff);
                    return 1;
                }
            }
            if (seek > 0) {
                llse_loff_t offset = seek;

                offset *= bs;       /* could exceed 32 bits here! */
                if (llse_llseek(outfd, offset, SEEK_SET) < 0) {
                    snprintf(ebuff, EBUFF_SZ, ME "couldn't seek to "
                             "required position on %s", outf);
                    perror(ebuff);
                    return 1;
                }
            }
        }
    }
    if ((STDIN_FILENO == infd) && (STDOUT_FILENO == outfd)) {
        fprintf(stderr, 
                "Can't have both 'if' as stdin _and_ 'of' as stdout\n");
        return 1;
    }
    if (dd_count < 0) {
        in_num_sect = -1;
        if (FT_SG == in_type) {
            res = scsi_read_capacity(infd, &in_num_sect, &in_sect_sz);
            if (2 == res) {
                fprintf(stderr, 
                        "Unit attention, media changed(in), continuing\n");
                res = scsi_read_capacity(infd, &in_num_sect, &in_sect_sz);
            }
            if (0 != res) {
                fprintf(stderr, "Unable to read capacity on %s\n", inf);
                in_num_sect = -1;
            }
        } else if (FT_BLOCK == in_type) {
            if (0 != read_blkdev_capacity(infd, &in_num_sect, &in_sect_sz)) {
                fprintf(stderr, "Unable to read block capacity on %s\n", inf);
                in_num_sect = -1;
            }
            if (bs != in_sect_sz) {
                fprintf(stderr, "block size on %s confusion; bs=%d, from "
                        "device=%d\n", inf, bs, in_sect_sz);
                in_num_sect = -1;
            }
        }
        if (in_num_sect > skip)
            in_num_sect -= skip;

        out_num_sect = -1;
        if (FT_SG == out_type) {
            res = scsi_read_capacity(outfd, &out_num_sect, &out_sect_sz);
            if (2 == res) {
                fprintf(stderr, 
                        "Unit attention, media changed(out), continuing\n");
                res = scsi_read_capacity(outfd, &out_num_sect, &out_sect_sz);
            }
            if (0 != res) {
                fprintf(stderr, "Unable to read capacity on %s\n", outf);
                out_num_sect = -1;
            }
        } else if (FT_BLOCK == out_type) {
            if (0 != read_blkdev_capacity(outfd, &out_num_sect,
                                          &out_sect_sz)) {
                fprintf(stderr, "Unable to read block capacity on %s\n",
                        outf);
                out_num_sect = -1;
            }
            if (bs != out_sect_sz) {
                fprintf(stderr, "block size on %s confusion: bs=%d, from "
                        "device=%d\n", outf, bs, out_sect_sz);
                out_num_sect = -1;
            }
        }
        if (out_num_sect > seek)
            out_num_sect -= seek;
#ifdef SG_DEBUG
        fprintf(stderr, 
            "Start of loop, count=%lld, in_num_sect=%lld, out_num_sect=%lld\n",
            dd_count, in_num_sect, out_num_sect);
#endif
        if (in_num_sect > 0) {
            if (out_num_sect > 0)
                dd_count = (in_num_sect > out_num_sect) ? out_num_sect :
                                                       in_num_sect;
            else
                dd_count = in_num_sect;
        }
        else
            dd_count = out_num_sect;
    }

    if (dd_count < 0) {
        fprintf(stderr, "Couldn't calculate count, please give one\n");
        return 1;
    }
    if ((FT_SG == in_type) && ((dd_count + skip) > UINT_MAX) &&
        (MAX_SCSI_CDBSZ != scsi_cdbsz_in)) {
        fprintf(stderr, "Note: SCSI command size increased to 16 bytes "
                "(for 'if')\n");
        scsi_cdbsz_in = MAX_SCSI_CDBSZ;
    }
    if ((FT_SG == out_type) && ((dd_count + seek) > UINT_MAX) &&
        (MAX_SCSI_CDBSZ != scsi_cdbsz_out)) {
        fprintf(stderr, "Note: SCSI command size increased to 16 bytes "
                "(for 'of')\n");
        scsi_cdbsz_out = MAX_SCSI_CDBSZ;
    }

    if (do_dio && (FT_SG != in_type)) {
        do_dio = 0;
        fprintf(stderr, ">>> dio only performed on 'of' side when 'if' is"
                " an sg device\n");
    }
    if (do_dio) {
        int fd;
        char c;

        if ((fd = open(proc_allow_dio, O_RDONLY)) >= 0) {
            if (1 == read(fd, &c, 1)) {
                if ('0' == c)
                    fprintf(stderr, ">>> %s set to '0' but should be set "
                            "to '1' for direct IO\n", proc_allow_dio);
            }
            close(fd);
        }
    }

    if (wrkMmap)
        wrkPos = wrkMmap;
    else {
        if ((FT_RAW == in_type) || (FT_RAW == out_type)) {
            wrkBuff = malloc(bs * bpt + psz);
            if (0 == wrkBuff) {
                fprintf(stderr, "Not enough user memory for raw\n");
                return 1;
            }
            wrkPos = (unsigned char *)(((unsigned long)wrkBuff + psz - 1) &
                                       (~(psz - 1)));
        }
        else {
            wrkBuff = malloc(bs * bpt);
            if (0 == wrkBuff) {
                fprintf(stderr, "Not enough user memory\n");
                return 1;
            }
            wrkPos = wrkBuff;
        }
    }

    blocks_per = bpt;
#ifdef SG_DEBUG
    fprintf(stderr, "Start of loop, count=%lld, blocks_per=%d\n", 
            dd_count, blocks_per);
#endif
    if (do_time) {
        start_tm.tv_sec = 0;
        start_tm.tv_usec = 0;
        gettimeofday(&start_tm, NULL);
    }
    req_count = dd_count;

    while (dd_count > 0) {
        blocks = (dd_count > blocks_per) ? blocks_per : dd_count;
        if (FT_SG == in_type) {
            int fua = fua_mode & 2;

            res = sg_read(infd, wrkPos, blocks, skip, bs, scsi_cdbsz_in, 
                          fua, 1);
            if (2 == res) {
                fprintf(stderr, 
                        "Unit attention, media changed, continuing (r)\n");
                res = sg_read(infd, wrkPos, blocks, skip, bs, scsi_cdbsz_in, 
                              fua, 1);
            }
            if (0 != res) {
                fprintf(stderr, "sg_read failed, skip=%lld\n", skip);
                break;
            }
            else
                in_full += blocks;
        }
        else {
            while (((res = read(infd, wrkPos, blocks * bs)) < 0) &&
                   (EINTR == errno))
                ;
            if (res < 0) {
                snprintf(ebuff, EBUFF_SZ, ME "reading, skip=%lld ", skip);
                perror(ebuff);
                break;
            }
            else if (res < blocks * bs) {
                dd_count = 0;
                blocks = res / bs;
                if ((res % bs) > 0) {
                    blocks++;
                    in_partial++;
                }
            }
            in_full += blocks;
        }

        if (0 == blocks)
            break;      /* read nothing so leave loop */

        if (FT_SG == out_type) {
            int do_mmap = (FT_SG == in_type) ? 0 : 1;
            int fua = fua_mode & 1;
            int dio_res = do_dio;

            res = sg_write(outfd, wrkPos, blocks, seek, bs, scsi_cdbsz_out, 
                           fua, do_mmap, &dio_res);
            if (2 == res) {
                fprintf(stderr, 
                        "Unit attention, media changed, continuing (w)\n");
                res = sg_write(outfd, wrkPos, blocks, seek, bs, scsi_cdbsz_out,
                               fua, do_mmap, &dio_res);
            }
            else if (0 != res) {
                fprintf(stderr, "sg_write failed, seek=%lld\n", seek);
                break;
            }
            else {
                out_full += blocks;
                if (do_dio && (0 == dio_res))
                    num_dio_not_done++;
            }
        }
        else if (FT_DEV_NULL == out_type)
            out_full += blocks; /* act as if written out without error */
        else {
            while (((res = write(outfd, wrkPos, blocks * bs)) < 0)
                   && (EINTR == errno))
                ;
            if (res < 0) {
                snprintf(ebuff, EBUFF_SZ, ME "writing, seek=%lld ", seek);
                perror(ebuff);
                break;
            }
            else if (res < blocks * bs) {
                fprintf(stderr, "output file probably full, seek=%lld ", seek);
                blocks = res / bs;
                out_full += blocks;
                if ((res % bs) > 0)
                    out_partial++;
                break;
            }
            else
                out_full += blocks;
        }
        if (dd_count > 0)
            dd_count -= blocks;
        skip += blocks;
        seek += blocks;
    }
    if ((do_time) && (start_tm.tv_sec || start_tm.tv_usec)) {
        struct timeval res_tm;
        double a, b;

        gettimeofday(&end_tm, NULL);
        res_tm.tv_sec = end_tm.tv_sec - start_tm.tv_sec;
        res_tm.tv_usec = end_tm.tv_usec - start_tm.tv_usec;
        if (res_tm.tv_usec < 0) {
            --res_tm.tv_sec;
            res_tm.tv_usec += 1000000;
        }
        a = res_tm.tv_sec;
        a += (0.000001 * res_tm.tv_usec);
        b = (double)bs * (req_count - dd_count);
        fprintf(stderr, "time to transfer data was %d.%06d secs",
               (int)res_tm.tv_sec, (int)res_tm.tv_usec);
        if ((a > 0.00001) && (b > 511))
            fprintf(stderr, ", %.2f MB/sec\n", b / (a * 1000000.0));
        else
            fprintf(stderr, "\n");
    }
    if (do_sync) {
        if (FT_SG == out_type) {
            fprintf(stderr, ">> Synchronizing cache on %s\n", outf);
            res = sg_ll_sync_cache_10(outfd, 0, 0, 0, 0, 0, 0, 0);
            if (2 == res) {
                fprintf(stderr,
                        "Unit attention, media changed(in), continuing\n");
                res = sg_ll_sync_cache_10(outfd, 0, 0, 0, 0, 0, 0, 0);
            }
            if (0 != res)
                fprintf(stderr, "Unable to synchronize cache\n");
        }
    }

    if (wrkBuff) free(wrkBuff);
    if (STDIN_FILENO != infd)
        close(infd);
    if ((STDOUT_FILENO != outfd) && (FT_DEV_NULL != out_type))
        close(outfd);
    res = 0;
    if (0 != dd_count) {
        fprintf(stderr, "Some error occurred,");
        res = 2;
    }
    print_stats();
    if (sum_of_resids)
        fprintf(stderr, ">> Non-zero sum of residual counts=%d\n", 
                sum_of_resids);
    if (num_dio_not_done)
        fprintf(stderr, ">> dio requested but _not done %d times\n", 
                num_dio_not_done);
    return res;
}

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-18 22:29                   ` Fajun Chen
@ 2007-11-18 23:07                     ` Mark Lord
  2007-11-19 16:40                       ` James Chapman
  0 siblings, 1 reply; 19+ messages in thread
From: Mark Lord @ 2007-11-18 23:07 UTC (permalink / raw)
  To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo

Fajun Chen wrote:
>
> As a matter of fact, I'm using /dev/sg*.  Due to the size of my test
> application, I have not be able to compress it into a small and
> publishable form. However, this issue can be easily reproduced on my
> ARM XScale target using sg3_util code as follows:
> 1. Run printtime.c attached,  which prints message to console in a loop.
> 2. Run sgm_dd (part of sg3_util package, source code attached) on the
> same system as follows:
>> sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1
> The print task can be delayed for as many as 25 seconds. Surprisingly,
> I can't reproduce the problem in an i386 test system with a more
> powerful processor.
> 
> Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a
> bug and more testing, this option seems to make no difference to cpu
> load. Sorry about previous report. Back to the drawing board now :-)
..

Okay, I don't see anything unusual here.  The code is on a slow CPU,
and is triggering 10MBytes of PIO over a (probably) slow bus to an ATA device.

This *will* tie up the CPU at 100% for the duration of the I/O,
because the I/O happens in interrupt handlers, which are outside
of the realm of the CPU scheduler.

This is a known shortcoming of Linux for real-time uses.

When the I/O uses DMA transfers, it *may* still have a similar effect,
depending upon the caching in the ATA device, and on how the DMA shares
the memory bus with the CPU.

Again, no surprise here.

One way to deal with it in an embedded device, is to force the
application that's generating the I/O to self-throttle.
Or modify the device driver to self-throttle.

You may want to find an embedded Linux consultant to help out
with this situation if it's beyond your expertise.

Cheers

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-18 23:07                     ` Mark Lord
@ 2007-11-19 16:40                       ` James Chapman
  2007-11-19 16:51                         ` Tejun Heo
  0 siblings, 1 reply; 19+ messages in thread
From: James Chapman @ 2007-11-19 16:40 UTC (permalink / raw)
  To: Mark Lord, Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo

Mark Lord wrote:
> Fajun Chen wrote:
>>
>> As a matter of fact, I'm using /dev/sg*.  Due to the size of my test
>> application, I have not be able to compress it into a small and
>> publishable form. However, this issue can be easily reproduced on my
>> ARM XScale target using sg3_util code as follows:
>> 1. Run printtime.c attached,  which prints message to console in a loop.
>> 2. Run sgm_dd (part of sg3_util package, source code attached) on the
>> same system as follows:
>>> sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1
>> The print task can be delayed for as many as 25 seconds. Surprisingly,
>> I can't reproduce the problem in an i386 test system with a more
>> powerful processor.
>>
>> Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a
>> bug and more testing, this option seems to make no difference to cpu
>> load. Sorry about previous report. Back to the drawing board now :-)
> ..
> 
> Okay, I don't see anything unusual here.  The code is on a slow CPU,
> and is triggering 10MBytes of PIO over a (probably) slow bus to an ATA
> device.
> 
> This *will* tie up the CPU at 100% for the duration of the I/O,
> because the I/O happens in interrupt handlers, which are outside
> of the realm of the CPU scheduler.
> 
> This is a known shortcoming of Linux for real-time uses.
> 
> When the I/O uses DMA transfers, it *may* still have a similar effect,
> depending upon the caching in the ATA device, and on how the DMA shares
> the memory bus with the CPU.
> 
> Again, no surprise here.
> 
> One way to deal with it in an embedded device, is to force the
> application that's generating the I/O to self-throttle.
> Or modify the device driver to self-throttle.

Does disk access have to be so interrupt driven? Could disk interrupt
handling be done in a softirq/kthread like the networking guys deal with
network device interrupts? This would prevent the system from
live-locking when it is being bombarded with disk IO events. It doesn't
seem right that the disk IO subsystem can cause interrupt live-lock on
relatively slow CPUs...

> You may want to find an embedded Linux consultant to help out
> with this situation if it's beyond your expertise.

Check out the rtlinux patch, which pushes all interrupt handling out to
per-cpu kernel threads (irqd). The kernel scheduler then regains control
of what runs when.

Another option is to change your ATA driver to do interrupt processing
at task level using a workqueue or similar.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-19 16:40                       ` James Chapman
@ 2007-11-19 16:51                         ` Tejun Heo
  2007-11-19 17:17                           ` Alan Cox
  0 siblings, 1 reply; 19+ messages in thread
From: Tejun Heo @ 2007-11-19 16:51 UTC (permalink / raw)
  To: James Chapman
  Cc: Mark Lord, Fajun Chen, linux-ide@vger.kernel.org, linux-scsi

James Chapman wrote:
> Mark Lord wrote:
>> One way to deal with it in an embedded device, is to force the
>> application that's generating the I/O to self-throttle.
>> Or modify the device driver to self-throttle.
> 
> Does disk access have to be so interrupt driven? Could disk interrupt
> handling be done in a softirq/kthread like the networking guys deal with
> network device interrupts? This would prevent the system from
> live-locking when it is being bombarded with disk IO events. It doesn't
> seem right that the disk IO subsystem can cause interrupt live-lock on
> relatively slow CPUs...
> 
>> You may want to find an embedded Linux consultant to help out
>> with this situation if it's beyond your expertise.
> 
> Check out the rtlinux patch, which pushes all interrupt handling out to
> per-cpu kernel threads (irqd). The kernel scheduler then regains control
> of what runs when.
> 
> Another option is to change your ATA driver to do interrupt processing
> at task level using a workqueue or similar.

SFF ATA controllers are peculiar in that...

1. it doesn't have reliable IRQ pending bit.

2. it doesn't have reliable IRQ mask bit.

3. some controllers tank the machine completely if status or data
register is accessed differently than the chip likes.

So, it's not like we're all dickheads.  We know it's good to take those
out of irq handler.  The hardware just isn't very forgiving and I bet
you'll get obscure machine lockups if the RT kernel arbitrarily pushes
ATA PIO data transfers into kernel threads.

I think doing what IDE has been doing (disabling IRQ from interrupt
controller) is the way to go.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Process Scheduling Issue using sg/libata
  2007-11-19 16:51                         ` Tejun Heo
@ 2007-11-19 17:17                           ` Alan Cox
  0 siblings, 0 replies; 19+ messages in thread
From: Alan Cox @ 2007-11-19 17:17 UTC (permalink / raw)
  To: Tejun Heo
  Cc: James Chapman, Mark Lord, Fajun Chen, linux-ide@vger.kernel.org,
	linux-scsi

> SFF ATA controllers are peculiar in that...
> 
> 1. it doesn't have reliable IRQ pending bit.
> 
> 2. it doesn't have reliable IRQ mask bit.
> 
> 3. some controllers tank the machine completely if status or data
> register is accessed differently than the chip likes.

And 4. which is a killer for a lot of RT users

An I/O cycle to a taskfile style controller generally goes at ISA type
speed down the wire to the drive and back again. The CPU is stalled for
this and there is nothing we can do about it.

> 
> So, it's not like we're all dickheads.  We know it's good to take those
> out of irq handler.  The hardware just isn't very forgiving and I bet
> you'll get obscure machine lockups if the RT kernel arbitrarily pushes
> ATA PIO data transfers into kernel threads.
> 
> I think doing what IDE has been doing (disabling IRQ from interrupt
> controller) is the way to go.

Agreed - at which point RT or otherwise you can push it out. If you need
to do serious (sub 1mS) ATA then also go get a non SFF controller.

Alan

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2007-11-19 17:17 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-17  0:49 Process Scheduling Issue using sg/libata Fajun Chen
2007-11-17  3:02 ` Tejun Heo
2007-11-17  6:14   ` Fajun Chen
2007-11-17 17:13     ` James Chapman
2007-11-17 19:37       ` Fajun Chen
2007-11-17  4:30 ` Mark Lord
2007-11-17  7:20   ` Fajun Chen
2007-11-17 16:25     ` Mark Lord
2007-11-17 19:20       ` Fajun Chen
2007-11-17 19:55         ` Mark Lord
2007-11-18  6:48           ` Fajun Chen
2007-11-18 14:32             ` Mark Lord
2007-11-18 19:14               ` Fajun Chen
2007-11-18 19:54                 ` Mark Lord
2007-11-18 22:29                   ` Fajun Chen
2007-11-18 23:07                     ` Mark Lord
2007-11-19 16:40                       ` James Chapman
2007-11-19 16:51                         ` Tejun Heo
2007-11-19 17:17                           ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).