* Process Scheduling Issue using sg/libata @ 2007-11-17 0:49 Fajun Chen 2007-11-17 3:02 ` Tejun Heo 2007-11-17 4:30 ` Mark Lord 0 siblings, 2 replies; 19+ messages in thread From: Fajun Chen @ 2007-11-17 0:49 UTC (permalink / raw) To: linux-ide@vger.kernel.org, linux-scsi Hi All, I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 and libata version 2.00 are loaded on ARM XScale board. Under heavy cpu load (e.g. when blocks per transfer/sector count is set to 1), I've observed that the test application can suck cpu away for long time (more than 20 seconds) and other processes including high priority shell can not get the time slice to run. What's interesting is that if the application is under heavy IO load (e.g. when blocks per transfer/sector count is set to 256), the problem goes away. I also tested with open source code sg_utils and got the same result, so this is not a problem specific to my user-space application. Since user preemption is checked when the kernel is about to return to user-space from a system call, process scheduler should be invoked after each system call. Something seems to be broken here. I found a similar issue below: http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2 But that turns out to be an issue with MTD/JFFS2 drivers, which are not used in my system. Has anyone experienced similar issues with sg/libata? Any information would be greatly appreciated. Thanks, Fajun ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-17 0:49 Process Scheduling Issue using sg/libata Fajun Chen @ 2007-11-17 3:02 ` Tejun Heo 2007-11-17 6:14 ` Fajun Chen 2007-11-17 4:30 ` Mark Lord 1 sibling, 1 reply; 19+ messages in thread From: Tejun Heo @ 2007-11-17 3:02 UTC (permalink / raw) To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi Fajun Chen wrote: > I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 > and libata version 2.00 are loaded on ARM XScale board. Under heavy > cpu load (e.g. when blocks per transfer/sector count is set to 1), > I've observed that the test application can suck cpu away for long > time (more than 20 seconds) and other processes including high > priority shell can not get the time slice to run. What's interesting > is that if the application is under heavy IO load (e.g. when blocks > per transfer/sector count is set to 256), the problem goes away. I > also tested with open source code sg_utils and got the same result, so > this is not a problem specific to my user-space application. > > Since user preemption is checked when the kernel is about to return to > user-space from a system call, process scheduler should be invoked > after each system call. Something seems to be broken here. I found a > similar issue below: > http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2 > But that turns out to be an issue with MTD/JFFS2 drivers, which are > not used in my system. > > Has anyone experienced similar issues with sg/libata? Any information > would be greatly appreciated. That's one weird story. Does kernel say anything during that 20 seconds? -- tejun ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-17 3:02 ` Tejun Heo @ 2007-11-17 6:14 ` Fajun Chen 2007-11-17 17:13 ` James Chapman 0 siblings, 1 reply; 19+ messages in thread From: Fajun Chen @ 2007-11-17 6:14 UTC (permalink / raw) To: Tejun Heo; +Cc: linux-ide@vger.kernel.org, linux-scsi On 11/16/07, Tejun Heo <htejun@gmail.com> wrote: > Fajun Chen wrote: > > I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 > > and libata version 2.00 are loaded on ARM XScale board. Under heavy > > cpu load (e.g. when blocks per transfer/sector count is set to 1), > > I've observed that the test application can suck cpu away for long > > time (more than 20 seconds) and other processes including high > > priority shell can not get the time slice to run. What's interesting > > is that if the application is under heavy IO load (e.g. when blocks > > per transfer/sector count is set to 256), the problem goes away. I > > also tested with open source code sg_utils and got the same result, so > > this is not a problem specific to my user-space application. > > > > Since user preemption is checked when the kernel is about to return to > > user-space from a system call, process scheduler should be invoked > > after each system call. Something seems to be broken here. I found a > > similar issue below: > > http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2 > > But that turns out to be an issue with MTD/JFFS2 drivers, which are > > not used in my system. > > > > Has anyone experienced similar issues with sg/libata? Any information > > would be greatly appreciated. > > That's one weird story. Does kernel say anything during that 20 seconds? > No. Nothing in kernel log. Fajun ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-17 6:14 ` Fajun Chen @ 2007-11-17 17:13 ` James Chapman 2007-11-17 19:37 ` Fajun Chen 0 siblings, 1 reply; 19+ messages in thread From: James Chapman @ 2007-11-17 17:13 UTC (permalink / raw) To: Fajun Chen; +Cc: Tejun Heo, linux-ide@vger.kernel.org, linux-scsi Fajun Chen wrote: > On 11/16/07, Tejun Heo <htejun@gmail.com> wrote: >> Fajun Chen wrote: >>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 >>> and libata version 2.00 are loaded on ARM XScale board. Under heavy >>> cpu load (e.g. when blocks per transfer/sector count is set to 1), >>> I've observed that the test application can suck cpu away for long >>> time (more than 20 seconds) and other processes including high >>> priority shell can not get the time slice to run. What's interesting >>> is that if the application is under heavy IO load (e.g. when blocks >>> per transfer/sector count is set to 256), the problem goes away. I >>> also tested with open source code sg_utils and got the same result, so >>> this is not a problem specific to my user-space application. >>> >>> Since user preemption is checked when the kernel is about to return to >>> user-space from a system call, process scheduler should be invoked >>> after each system call. Something seems to be broken here. I found a >>> similar issue below: >>> http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2 >>> But that turns out to be an issue with MTD/JFFS2 drivers, which are >>> not used in my system. >>> >>> Has anyone experienced similar issues with sg/libata? Any information >>> would be greatly appreciated. >> That's one weird story. Does kernel say anything during that 20 seconds? >> > No. Nothing in kernel log. > > Fajun Have you considered using oprofile to find out what the CPU is doing during the 20 seconds? Does the problem occur when you put it under load using another method? What are the ATA and network drivers here? I've seen some awful out-of-tree device drivers hog the CPU with busy-waits and other crap. Oprofile results should show the culprit. -- James Chapman Katalix Systems Ltd http://www.katalix.com Catalysts for your Embedded Linux software development ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-17 17:13 ` James Chapman @ 2007-11-17 19:37 ` Fajun Chen 0 siblings, 0 replies; 19+ messages in thread From: Fajun Chen @ 2007-11-17 19:37 UTC (permalink / raw) To: James Chapman; +Cc: Tejun Heo, linux-ide@vger.kernel.org, linux-scsi On 11/17/07, James Chapman <jchapman@katalix.com> wrote: > Fajun Chen wrote: > > On 11/16/07, Tejun Heo <htejun@gmail.com> wrote: > >> Fajun Chen wrote: > >>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 > >>> and libata version 2.00 are loaded on ARM XScale board. Under heavy > >>> cpu load (e.g. when blocks per transfer/sector count is set to 1), > >>> I've observed that the test application can suck cpu away for long > >>> time (more than 20 seconds) and other processes including high > >>> priority shell can not get the time slice to run. What's interesting > >>> is that if the application is under heavy IO load (e.g. when blocks > >>> per transfer/sector count is set to 256), the problem goes away. I > >>> also tested with open source code sg_utils and got the same result, so > >>> this is not a problem specific to my user-space application. > >>> > >>> Since user preemption is checked when the kernel is about to return to > >>> user-space from a system call, process scheduler should be invoked > >>> after each system call. Something seems to be broken here. I found a > >>> similar issue below: > >>> http://marc.info/?l=linux-arm-kernel&m=103121214521819&w=2 > >>> But that turns out to be an issue with MTD/JFFS2 drivers, which are > >>> not used in my system. > >>> > >>> Has anyone experienced similar issues with sg/libata? Any information > >>> would be greatly appreciated. > >> That's one weird story. Does kernel say anything during that 20 seconds? > >> > > No. Nothing in kernel log. > > > > Fajun > > Have you considered using oprofile to find out what the CPU is doing > during the 20 seconds? > Haven't tried oprofile yet, not sure if it will get the time slice to run though. During this 20 seconds, I've verified that my application is still busy with R/W ops. > Does the problem occur when you put it under load using another method? > What are the ATA and network drivers here? I've seen some awful > out-of-tree device drivers hog the CPU with busy-waits and other crap. > Oprofile results should show the culprit. If blocks per transfer/sector count is set to 256, which means cpu has less load (any other implications?), this problem no longer occurs. Our target system uses libata sil24/pata680 drivers, has a customized FIFO driver but no network driver. The relevant variable here is blocks per transfer/sector count, which seems to matter only to sg/libata. Thanks, Fajun ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-17 0:49 Process Scheduling Issue using sg/libata Fajun Chen 2007-11-17 3:02 ` Tejun Heo @ 2007-11-17 4:30 ` Mark Lord 2007-11-17 7:20 ` Fajun Chen 1 sibling, 1 reply; 19+ messages in thread From: Mark Lord @ 2007-11-17 4:30 UTC (permalink / raw) To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo Fajun Chen wrote: > Hi All, > > I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 > and libata version 2.00 are loaded on ARM XScale board. Under heavy > cpu load (e.g. when blocks per transfer/sector count is set to 1), > I've observed that the test application can suck cpu away for long > time (more than 20 seconds) and other processes including high > priority shell can not get the time slice to run. What's interesting > is that if the application is under heavy IO load (e.g. when blocks > per transfer/sector count is set to 256), the problem goes away. I > also tested with open source code sg_utils and got the same result, so > this is not a problem specific to my user-space application. .. Post the relevant code here, and then we'll be able to better understand and explain it to you. For example, if the code is using ATA opcodes 0x20, 0x21, 0x24, 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops), then this behaviour does not surprise me in the least. Fully expected and difficult to avoid. Cheers ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-17 4:30 ` Mark Lord @ 2007-11-17 7:20 ` Fajun Chen 2007-11-17 16:25 ` Mark Lord 0 siblings, 1 reply; 19+ messages in thread From: Fajun Chen @ 2007-11-17 7:20 UTC (permalink / raw) To: Mark Lord; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo On 11/16/07, Mark Lord <liml@rtr.ca> wrote: > Fajun Chen wrote: > > Hi All, > > > > I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 > > and libata version 2.00 are loaded on ARM XScale board. Under heavy > > cpu load (e.g. when blocks per transfer/sector count is set to 1), > > I've observed that the test application can suck cpu away for long > > time (more than 20 seconds) and other processes including high > > priority shell can not get the time slice to run. What's interesting > > is that if the application is under heavy IO load (e.g. when blocks > > per transfer/sector count is set to 256), the problem goes away. I > > also tested with open source code sg_utils and got the same result, so > > this is not a problem specific to my user-space application. > .. > > Post the relevant code here, and then we'll be able to better understand > and explain it to you. > > For example, if the code is using ATA opcodes 0x20, 0x21, 0x24, > 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops), > then this behaviour does not surprise me in the least. Fully expected > and difficult to avoid. > This problem also happens with R/W DMA ops. Below are simplified code snippets: // Open one sg device for read if ((sg_fd = open(dev_name, O_RDWR))<0) { ... } read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED, sg_fd, 0); // Open the same sg device for write if ((sg_fd_wr = open(dev_name, O_RDWR))<0) { ... } write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED, sg_fd_wr, 0); sg_io_hdr_t io_hdr; memset(&io_hdr, 0, sizeof(sg_io_hdr_t)); io_hdr.interface_id = 'S'; io_hdr.mx_sb_len = sizeof(sense_buffer); io_hdr.sbp = sense_buffer; io_hdr.dxfer_len = dxfer_len; io_hdr.cmd_len = cmd_len; io_hdr.cmdp = cmdp; // ATA pass through command block io_hdr.timeout = cmd_tmo * 1000; // In millisecs io_hdr.pack_id = id; // Read/write counter for now io_hdr.iovec_count=0; // scatter gather elements, 0=not being used if (direction == 1) { io_hdr.dxfer_direction = SG_DXFER_TO_DEV; io_hdr.flags |= SG_FLAG_MMAP_IO; status = ioctl(sg_fd_wr, SG_IO, &io_hdr); } else { io_hdr.dxfer_direction = SG_DXFER_FROM_DEV; io_hdr.flags |= SG_FLAG_MMAP_IO; status = ioctl(sg_fd, SG_IO, &io_hdr); } ... Mmaped IO is a moot point here since this problem is also observed when using direct IO. Thanks, Fajun ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-17 7:20 ` Fajun Chen @ 2007-11-17 16:25 ` Mark Lord 2007-11-17 19:20 ` Fajun Chen 0 siblings, 1 reply; 19+ messages in thread From: Mark Lord @ 2007-11-17 16:25 UTC (permalink / raw) To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo Fajun Chen wrote: > On 11/16/07, Mark Lord <liml@rtr.ca> wrote: >> Fajun Chen wrote: >>> Hi All, >>> >>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 >>> and libata version 2.00 are loaded on ARM XScale board. Under heavy >>> cpu load (e.g. when blocks per transfer/sector count is set to 1), >>> I've observed that the test application can suck cpu away for long >>> time (more than 20 seconds) and other processes including high >>> priority shell can not get the time slice to run. What's interesting >>> is that if the application is under heavy IO load (e.g. when blocks >>> per transfer/sector count is set to 256), the problem goes away. I >>> also tested with open source code sg_utils and got the same result, so >>> this is not a problem specific to my user-space application. >> .. >> >> Post the relevant code here, and then we'll be able to better understand >> and explain it to you. >> >> For example, if the code is using ATA opcodes 0x20, 0x21, 0x24, >> 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops), >> then this behaviour does not surprise me in the least. Fully expected >> and difficult to avoid. >> > > This problem also happens with R/W DMA ops. Below are simplified code snippets: > // Open one sg device for read > if ((sg_fd = open(dev_name, O_RDWR))<0) > { > ... > } > read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > MAP_SHARED, sg_fd, 0); > > // Open the same sg device for write > if ((sg_fd_wr = open(dev_name, O_RDWR))<0) > { > ... > } > write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > MAP_SHARED, sg_fd_wr, 0); .. Mmmm.. what is the purpose of those two mmap'd areas ? I think this is important and relevant here: what are they used for? As coded above, these are memory mapped areas taht (1) overlap, and (2) will be demand paged automatically to/from the disk as they are accessed/modified. This *will* conflict with any SG_IO operations happening at the same time on the same device. ???? > sg_io_hdr_t io_hdr; > > memset(&io_hdr, 0, sizeof(sg_io_hdr_t)); > > io_hdr.interface_id = 'S'; > io_hdr.mx_sb_len = sizeof(sense_buffer); > io_hdr.sbp = sense_buffer; > io_hdr.dxfer_len = dxfer_len; > io_hdr.cmd_len = cmd_len; > io_hdr.cmdp = cmdp; // ATA pass through command block > io_hdr.timeout = cmd_tmo * 1000; // In millisecs > io_hdr.pack_id = id; // Read/write counter for now > io_hdr.iovec_count=0; // scatter gather elements, 0=not being used > > if (direction == 1) > { > io_hdr.dxfer_direction = SG_DXFER_TO_DEV; > io_hdr.flags |= SG_FLAG_MMAP_IO; > status = ioctl(sg_fd_wr, SG_IO, &io_hdr); > } > else > { > io_hdr.dxfer_direction = SG_DXFER_FROM_DEV; > io_hdr.flags |= SG_FLAG_MMAP_IO; > status = ioctl(sg_fd, SG_IO, &io_hdr); > } > ... > Mmaped IO is a moot point here since this problem is also observed > when using direct IO. > > Thanks, > Fajun ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-17 16:25 ` Mark Lord @ 2007-11-17 19:20 ` Fajun Chen 2007-11-17 19:55 ` Mark Lord 0 siblings, 1 reply; 19+ messages in thread From: Fajun Chen @ 2007-11-17 19:20 UTC (permalink / raw) To: Mark Lord; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo On 11/17/07, Mark Lord <liml@rtr.ca> wrote: > Fajun Chen wrote: > > On 11/16/07, Mark Lord <liml@rtr.ca> wrote: > >> Fajun Chen wrote: > >>> Hi All, > >>> > >>> I use sg/libata and ata pass through for read/writes. Linux 2.6.18-rc2 > >>> and libata version 2.00 are loaded on ARM XScale board. Under heavy > >>> cpu load (e.g. when blocks per transfer/sector count is set to 1), > >>> I've observed that the test application can suck cpu away for long > >>> time (more than 20 seconds) and other processes including high > >>> priority shell can not get the time slice to run. What's interesting > >>> is that if the application is under heavy IO load (e.g. when blocks > >>> per transfer/sector count is set to 256), the problem goes away. I > >>> also tested with open source code sg_utils and got the same result, so > >>> this is not a problem specific to my user-space application. > >> .. > >> > >> Post the relevant code here, and then we'll be able to better understand > >> and explain it to you. > >> > >> For example, if the code is using ATA opcodes 0x20, 0x21, 0x24, > >> 0x30, 0x31, 0x34, 0x29, 0x39, 0xc4 or 0xc5 (any of the R/W PIO ops), > >> then this behaviour does not surprise me in the least. Fully expected > >> and difficult to avoid. > >> > > > > This problem also happens with R/W DMA ops. Below are simplified code snippets: > > // Open one sg device for read > > if ((sg_fd = open(dev_name, O_RDWR))<0) > > { > > ... > > } > > read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > > MAP_SHARED, sg_fd, 0); > > > > // Open the same sg device for write > > if ((sg_fd_wr = open(dev_name, O_RDWR))<0) > > { > > ... > > } > > write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > > MAP_SHARED, sg_fd_wr, 0); > .. > > Mmmm.. what is the purpose of those two mmap'd areas ? > I think this is important and relevant here: what are they used for? > > As coded above, these are memory mapped areas taht (1) overlap, > and (2) will be demand paged automatically to/from the disk > as they are accessed/modified. This *will* conflict with any SG_IO > operations happening at the same time on the same device. > > ???? The purpose of using two memory mapped areas is to meet our requirement that certain data patterns for writing need to be kept across commands. For instance, if one buffer is used for both reads and writes, then this buffer will need to be re-populated with certain write data after each read command, which would be very costly for write-read mixed type of ops. This separate R/W buffer setting also facilitates data comparison. These buffers are not used at the same time (one will be used only after the command on the other is completed). My application is the only program accessing disk using sg/libata and the rest of the programs run from ramdisk. Also, each buffer is only about 0.5MB and we have 64MB RAM on the target board. With this setup, these two buffers should be pretty much independent and free from block layer/file system, correct? One thing is worthy of mentioning here. If the application is set to low priority (nice 19) or sched_yield() is called after each R/W command, then this issue disappears but performance suffers. Some thoughts here. For a static process, Linux scheduler could assign some dynamic priority to it based on activity and age, etc. Any chance that the scheduler favors my application unfairly due to the load condition? Thanks, Fajun ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-17 19:20 ` Fajun Chen @ 2007-11-17 19:55 ` Mark Lord 2007-11-18 6:48 ` Fajun Chen 0 siblings, 1 reply; 19+ messages in thread From: Mark Lord @ 2007-11-17 19:55 UTC (permalink / raw) To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo Fajun Chen wrote: > On 11/17/07, Mark Lord <liml@rtr.ca> wrote: >> Fajun Chen wrote: >>> On 11/16/07, Mark Lord <liml@rtr.ca> wrote: >>>> Fajun Chen wrote: .. >>> This problem also happens with R/W DMA ops. Below are simplified code snippets: >>> // Open one sg device for read >>> if ((sg_fd = open(dev_name, O_RDWR))<0) >>> { >>> ... >>> } >>> read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, >>> MAP_SHARED, sg_fd, 0); >>> >>> // Open the same sg device for write >>> if ((sg_fd_wr = open(dev_name, O_RDWR))<0) >>> { >>> ... >>> } >>> write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, >>> MAP_SHARED, sg_fd_wr, 0); >> .. >> >> Mmmm.. what is the purpose of those two mmap'd areas ? >> I think this is important and relevant here: what are they used for? >> >> As coded above, these are memory mapped areas taht (1) overlap, >> and (2) will be demand paged automatically to/from the disk >> as they are accessed/modified. This *will* conflict with any SG_IO >> operations happening at the same time on the same device. .. > The purpose of using two memory mapped areas is to meet our > requirement that certain data patterns for writing need to be kept > across commands. For instance, if one buffer is used for both reads > and writes, then this buffer will need to be re-populated with certain > write data after each read command, which would be very costly for > write-read mixed type of ops. This separate R/W buffer setting also > facilitates data comparison. > > These buffers are not used at the same time (one will be used only > after the command on the other is completed). My application is the > only program accessing disk using sg/libata and the rest of the > programs run from ramdisk. Also, each buffer is only about 0.5MB and > we have 64MB RAM on the target board. > With this setup, these two buffers should be pretty much independent > and free from block layer/file system, correct? .. No. Those "buffers" as coded above are actually mmap'ed representations of portions of the device (disk drive). So any write into one of those buffers will trigger disk writes, and just accessing ("read") the buffers may trigger disk reads. So what could be happening here, is when you trigger manual disk accesses via SG_IO, that result in data being copied into those "buffers", the kernel then automatically schedules disk writes to update the on-disk copies of those mmap'd regions. What you probably intended to do instead, was to use mmap to just allocate some page-aligned RAM, not to actually mmap'd any on-disk data. Right? Here's how that's done: read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0); Cheers ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-17 19:55 ` Mark Lord @ 2007-11-18 6:48 ` Fajun Chen 2007-11-18 14:32 ` Mark Lord 0 siblings, 1 reply; 19+ messages in thread From: Fajun Chen @ 2007-11-18 6:48 UTC (permalink / raw) To: Mark Lord; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo On 11/17/07, Mark Lord <liml@rtr.ca> wrote: > Fajun Chen wrote: > > On 11/17/07, Mark Lord <liml@rtr.ca> wrote: > >> Fajun Chen wrote: > >>> On 11/16/07, Mark Lord <liml@rtr.ca> wrote: > >>>> Fajun Chen wrote: > .. > >>> This problem also happens with R/W DMA ops. Below are simplified code snippets: > >>> // Open one sg device for read > >>> if ((sg_fd = open(dev_name, O_RDWR))<0) > >>> { > >>> ... > >>> } > >>> read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > >>> MAP_SHARED, sg_fd, 0); > >>> > >>> // Open the same sg device for write > >>> if ((sg_fd_wr = open(dev_name, O_RDWR))<0) > >>> { > >>> ... > >>> } > >>> write_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > >>> MAP_SHARED, sg_fd_wr, 0); > >> .. > >> > >> Mmmm.. what is the purpose of those two mmap'd areas ? > >> I think this is important and relevant here: what are they used for? > >> > >> As coded above, these are memory mapped areas taht (1) overlap, > >> and (2) will be demand paged automatically to/from the disk > >> as they are accessed/modified. This *will* conflict with any SG_IO > >> operations happening at the same time on the same device. > .. > > The purpose of using two memory mapped areas is to meet our > > requirement that certain data patterns for writing need to be kept > > across commands. For instance, if one buffer is used for both reads > > and writes, then this buffer will need to be re-populated with certain > > write data after each read command, which would be very costly for > > write-read mixed type of ops. This separate R/W buffer setting also > > facilitates data comparison. > > > > These buffers are not used at the same time (one will be used only > > after the command on the other is completed). My application is the > > only program accessing disk using sg/libata and the rest of the > > programs run from ramdisk. Also, each buffer is only about 0.5MB and > > we have 64MB RAM on the target board. > > With this setup, these two buffers should be pretty much independent > > and free from block layer/file system, correct? > .. > > No. Those "buffers" as coded above are actually mmap'ed representations > of portions of the device (disk drive). So any write into one of those > buffers will trigger disk writes, and just accessing ("read") the buffers > may trigger disk reads. > > So what could be happening here, is when you trigger manual disk accesses > via SG_IO, that result in data being copied into those "buffers", the kernel > then automatically schedules disk writes to update the on-disk copies of > those mmap'd regions. > > What you probably intended to do instead, was to use mmap to just allocate > some page-aligned RAM, not to actually mmap'd any on-disk data. Right? > > Here's how that's done: > > read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > MAP_SHARED|MAP_ANONYMOUS, -1, 0); > What I intended to do is to write data into disc or read data from disc via SG_IO as requested by my user-space application. I don't want any automatically scheduled kernel task to sync data with disc. I've experimented with memory mapping using MAP_ANONYMOUS as you suggested, the good news is that it does free up the cpu load and my system is much more responsive with the change. The bad news is that the data read back from disc (PIO or DMA read) seems to be invisible to user-space application. For instance, read buffer is all zeros after Identify Device command. Is this expected side effect of MAP_ANONYMOUS option? Thanks, Fajun ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-18 6:48 ` Fajun Chen @ 2007-11-18 14:32 ` Mark Lord 2007-11-18 19:14 ` Fajun Chen 0 siblings, 1 reply; 19+ messages in thread From: Mark Lord @ 2007-11-18 14:32 UTC (permalink / raw) To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo [-- Attachment #1: Type: text/plain, Size: 1418 bytes --] Fajun Chen wrote: > On 11/17/07, Mark Lord <liml@rtr.ca> wrote: .. >> What you probably intended to do instead, was to use mmap to just allocate >> some page-aligned RAM, not to actually mmap'd any on-disk data. Right? >> >> Here's how that's done: >> >> read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, >> MAP_SHARED|MAP_ANONYMOUS, -1, 0); >> > What I intended to do is to write data into disc or read data from > disc via SG_IO as requested by my user-space application. I don't want > any automatically scheduled kernel task to sync data with disc. .. Right. Then you definitely do NOT want to mmap your device, because that's exactly what would otherwise happen, by design! > I've experimented with memory mapping using MAP_ANONYMOUS as you > suggested, the good news is that it does free up the cpu load and my > system is much more responsive with the change. .. Yes, that's what we expected to see. > The bad news is that > the data read back from disc (PIO or DMA read) seems to be invisible > to user-space application. For instance, read buffer is all zeros > after Identify Device command. Is this expected side effect of > MAP_ANONYMOUS option? .. No, that would be a side effect of some other bug in the code. Here (attached) is a working program that performs (PACKET)IDENTIFY DEVICE commands, using a mmap() buffer to receive the data. Cheers [-- Attachment #2: sg_identify.c --] [-- Type: text/x-csrc, Size: 3678 bytes --] /* * This code is copyright 2007 by Mark Lord, * and is made available to all under the terms * of the GNU General Public License v2. */ #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <linux/fs.h> #include <linux/hdreg.h> #include <scsi/scsi.h> #include <scsi/sg.h> #include <sys/mman.h> typedef unsigned long long u64; enum { ATA_CMD_PIO_IDENTIFY = 0xec, ATA_CMD_PIO_PIDENTIFY = 0xa1, /* normal sector size (bytes) for PIO/DMA */ ATA_SECT_SIZE = 512, ATA_16 = 0x85, ATA_16_LEN = 16, ATA_DEV_REG_LBA = (1 << 6), ATA_LBA48 = 1, /* data transfer protocols; only basic PIO and DMA actually work */ ATA_PROTO_NON_DATA = ( 3 << 1), ATA_PROTO_PIO_IN = ( 4 << 1), ATA_PROTO_PIO_OUT = ( 5 << 1), ATA_PROTO_DMA = ( 6 << 1), ATA_PROTO_UDMA_IN = (11 << 1), /* unsupported */ ATA_PROTO_UDMA_OUT = (12 << 1), /* unsupported */ }; /* * Taskfile layout for ATA_16 cdb (LBA28/LBA48): * * cdb[ 4] = feature * cdb[ 6] = nsect * cdb[ 8] = lbal * cdb[10] = lbam * cdb[12] = lbah * cdb[13] = device * cdb[14] = command * * "high order byte" (hob) fields for LBA48 commands: * * cdb[ 3] = hob_feature * cdb[ 5] = hob_nsect * cdb[ 7] = hob_lbal * cdb[ 9] = hob_lbam * cdb[11] = hob_lbah * * dxfer_direction choices: * * SG_DXFER_TO_DEV (writing to drive) * SG_DXFER_FROM_DEV (reading from drive) * SG_DXFER_NONE (non-data commands) */ static int sg_issue (int fd, unsigned char ata_op, void *buf) { unsigned char cdb[ATA_16_LEN] = { ATA_16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; unsigned char sense[32]; unsigned int nsects = 1; struct sg_io_hdr hdr; cdb[ 1] = ATA_PROTO_PIO_IN; cdb[ 6] = nsects; cdb[14] = ata_op; memset(&hdr, 0, sizeof(struct sg_io_hdr)); hdr.interface_id = 'S'; hdr.cmd_len = ATA_16_LEN; hdr.mx_sb_len = sizeof(sense); hdr.dxfer_direction = SG_DXFER_FROM_DEV; hdr.dxfer_len = nsects * ATA_SECT_SIZE; hdr.dxferp = buf; hdr.cmdp = cdb; hdr.sbp = sense; hdr.timeout = 5000; /* milliseconds */ memset(sense, 0, sizeof(sense)); if (ioctl(fd, SG_IO, &hdr) < 0) { perror("ioctl(SG_IO)"); return (-1); } if (hdr.status == 0 && hdr.host_status == 0 && hdr.driver_status == 0) return 0; /* success */ if (hdr.status > 0) { unsigned char *d = sense + 8; /* SCSI status is non-zero */ fprintf(stderr, "SG_IO error: SCSI sense=0x%x/%02x/%02x, ATA=0x%02x/%02x\n", sense[1] & 0xf, sense[2], sense[3], d[13], d[3]); return -1; } /* some other error we don't know about yet */ fprintf(stderr, "SG_IO returned: SCSI status=0x%x, host_status=0x%x, driver_status=0x%x\n", hdr.status, hdr.host_status, hdr.driver_status); return -1; } int main (int argc, char *argv[]) { const char *devpath; int i, rc, fd; #if 0 unsigned short id[ATA_SECT_SIZE / 2]; memset(id, 0, sizeof(id)); #else unsigned short *id; id = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0); if (id == MAP_FAILED) { perror("mmap"); exit(1); } #endif if (argc != 2) { fprintf(stderr, "%s: bad/missing parm: expected <devpath>\n", argv[0]); exit(1); } devpath = argv[1]; fd = open(devpath, O_RDWR|O_NONBLOCK); if (fd == -1) { perror(devpath); exit(1); } rc = sg_issue(fd, ATA_CMD_PIO_IDENTIFY, id); if (rc != 0) rc = sg_issue(fd, ATA_CMD_PIO_PIDENTIFY, id); if (rc == 0) { unsigned short *d = id; for (i = 0; i < (256/8); ++i) { printf("%04x %04x %04x %04x %04x %04x %04x %04x\n", d[0], d[1], d[2], d[3], d[4], d[5], d[6], d[7]); d += 8; } exit(0); } exit(1); } ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-18 14:32 ` Mark Lord @ 2007-11-18 19:14 ` Fajun Chen 2007-11-18 19:54 ` Mark Lord 0 siblings, 1 reply; 19+ messages in thread From: Fajun Chen @ 2007-11-18 19:14 UTC (permalink / raw) To: Mark Lord; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo [-- Attachment #1: Type: text/plain, Size: 2312 bytes --] On 11/18/07, Mark Lord <liml@rtr.ca> wrote: > Fajun Chen wrote: > > On 11/17/07, Mark Lord <liml@rtr.ca> wrote: > .. > >> What you probably intended to do instead, was to use mmap to just allocate > >> some page-aligned RAM, not to actually mmap'd any on-disk data. Right? > >> > >> Here's how that's done: > >> > >> read_buffer = (U8 *)mmap(NULL, buf_sz, PROT_READ | PROT_WRITE, > >> MAP_SHARED|MAP_ANONYMOUS, -1, 0); > >> > > What I intended to do is to write data into disc or read data from > > disc via SG_IO as requested by my user-space application. I don't want > > any automatically scheduled kernel task to sync data with disc. > .. > > Right. Then you definitely do NOT want to mmap your device, > because that's exactly what would otherwise happen, by design! > > > > I've experimented with memory mapping using MAP_ANONYMOUS as you > > suggested, the good news is that it does free up the cpu load and my > > system is much more responsive with the change. > .. > > Yes, that's what we expected to see. > > > > The bad news is that > > the data read back from disc (PIO or DMA read) seems to be invisible > > to user-space application. For instance, read buffer is all zeros > > after Identify Device command. Is this expected side effect of > > MAP_ANONYMOUS option? > .. > > No, that would be a side effect of some other bug in the code. > > Here (attached) is a working program that performs (PACKET)IDENTIFY DEVICE > commands, using a mmap() buffer to receive the data. > I verified your program works in my system and my application works as well if changed accordingly. However, this change (indirect IO in sg term) may come at a performance cost for IO intensive applications since it does NOT utilize mmaped buffer managed by sg driver. Please see relevant sg document below: http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330 http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag in SG_IO. Please see source code attached. I also noticed that MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not be desirable as you pointed out in previous emails. So this brings up an interesting sg usage issue: can we use MAP_ANONYMOUS with SG_FLAG_MMAP_IO flag in SG_IO? Thanks, Fajun [-- Attachment #2: sg_rbuf.c --] [-- Type: application/octet-stream, Size: 12049 bytes --] #define _XOPEN_SOURCE 500 #define _GNU_SOURCE #include <unistd.h> #include <signal.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <errno.h> #include <sys/ioctl.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <sys/time.h> #include <linux/fs.h> #include "sg_include.h" #include "sg_lib.h" /* Test code for D. Gilbert's extensions to the Linux OS SCSI generic ("sg") device driver. * Copyright (C) 1999-2004 D. Gilbert * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2, or (at your option) * any later version. This program uses the SCSI command READ BUFFER on the given sg device, first to find out how big it is and then to read that buffer. The '-q' option skips the data transfer from the kernel DMA buffers to the user space. The '-b=num' option allows the buffer size (in KiB) to be specified (default is to use the number obtained from READ BUFFER (descriptor) SCSI command). The '-s=num' option allows the total size of the transfer to be set (in megabytes, the default is 200 MiB). The '-d' option requests direct io (and is overridden by '-q'). The '-m' option request mmap-ed IO (and overrides the '-q' and '-d' options if they are also given). The ability to time transfers internally (based on gettimeofday()) has been added with the '-t' option. */ #define RB_MODE_DESC 3 #define RB_MODE_DATA 2 #define RB_DESC_LEN 4 #define RB_MIB_TO_READ 200 #define RB_OPCODE 0x3C #define RB_CMD_LEN 10 /* #define SG_DEBUG */ #ifndef SG_FLAG_MMAP_IO #define SG_FLAG_MMAP_IO 4 #endif #define ME "sg_rbuf: " static char * version_str = "4.77 20041011"; static void usage() { printf("Usage: sg_rbuf [-b=num] [[-q] | [-d] | [-m]] [-s=num] [-t] " "[-v] [-V]\n <generic_device>\n"); printf(" where -b=num num is buffer size to use (in KiB)\n"); printf(" -d requests dio ('-q' overrides it)\n"); printf(" -m requests mmap-ed IO (overrides -q, -d)\n"); printf(" -q quick, don't xfer to user space\n"); printf(" -s=num num is total size to read (in MiB)\n"); printf(" default total size is 200 MiB\n"); printf(" max total size is 4000 MiB\n"); printf(" -t time the data transfer\n"); printf(" -v increase verbosity (more debug)\n"); printf(" -V print version string then exit\n"); } int main(int argc, char * argv[]) { int sg_fd, res, j, m; unsigned int k, num; unsigned char rbCmdBlk [RB_CMD_LEN]; unsigned char * rbBuff = NULL; void * rawp = NULL; unsigned char sense_buffer[32]; int buf_capacity = 0; int do_quick = 0; int do_dio = 0; int do_mmap = 0; int do_time = 0; int verbose = 0; int buf_size = 0; unsigned int total_size_mib = RB_MIB_TO_READ; char * file_name = 0; size_t psz = getpagesize(); int dio_incomplete = 0; struct sg_io_hdr io_hdr; struct timeval start_tm, end_tm; #ifdef SG_DEBUG int clear = 1; #endif for (j = 1; j < argc; ++j) { if (0 == strncmp("-b=", argv[j], 3)) { m = 3; num = sscanf(argv[j] + m, "%d", &buf_size); if ((1 != num) || (buf_size <= 0)) { printf("Couldn't decode number after '-b' switch\n"); file_name = 0; break; } buf_size *= 1024; } else if (0 == strncmp("-s=", argv[j], 3)) { m = 3; num = sscanf(argv[j] + m, "%u", &total_size_mib); if (1 != num) { printf("Couldn't decode number after '-s' switch\n"); file_name = 0; break; } } else if (0 == strcmp("-q", argv[j])) do_quick = 1; else if (0 == strcmp("-d", argv[j])) do_dio = 1; else if (0 == strcmp("-m", argv[j])) do_mmap = 1; else if (0 == strcmp("-t", argv[j])) do_time = 1; else if (0 == strcmp("-v", argv[j])) ++verbose; else if (0 == strcmp("-V", argv[j])) { fprintf(stderr, ME "version: %s\n", version_str); return 0; } else if (*argv[j] == '-') { printf("Unrecognized switch: %s\n", argv[j]); file_name = 0; break; } else file_name = argv[j]; } if (0 == file_name) { usage(); return 1; } sg_fd = open(file_name, O_RDONLY); if (sg_fd < 0) { perror(ME "open error"); return 1; } /* Don't worry, being very careful not to write to a none-sg file ... */ if (do_mmap) { do_dio = 0; do_quick = 0; } if (NULL == (rawp = malloc(512))) { printf(ME "out of memory (query)\n"); return 1; } rbBuff = rawp; memset(rbCmdBlk, 0, RB_CMD_LEN); rbCmdBlk[0] = RB_OPCODE; rbCmdBlk[1] = RB_MODE_DESC; rbCmdBlk[8] = RB_DESC_LEN; memset(&io_hdr, 0, sizeof(struct sg_io_hdr)); io_hdr.interface_id = 'S'; io_hdr.cmd_len = sizeof(rbCmdBlk); io_hdr.mx_sb_len = sizeof(sense_buffer); io_hdr.dxfer_direction = SG_DXFER_FROM_DEV; io_hdr.dxfer_len = RB_DESC_LEN; io_hdr.dxferp = rbBuff; io_hdr.cmdp = rbCmdBlk; io_hdr.sbp = sense_buffer; io_hdr.timeout = 60000; /* 60000 millisecs == 60 seconds */ /* do normal IO to find RB size (not dio or mmap-ed at this stage) */ if (ioctl(sg_fd, SG_IO, &io_hdr) < 0) { perror(ME "SG_IO READ BUFFER descriptor error"); if (rawp) free(rawp); return 1; } /* now for the error processing */ switch (sg_err_category3(&io_hdr)) { case SG_LIB_CAT_CLEAN: break; case SG_LIB_CAT_RECOVERED: printf("Recovered error on READ BUFFER descriptor, continuing\n"); break; default: /* won't bother decoding other categories */ sg_chk_n_print3("READ BUFFER descriptor error", &io_hdr); if (rawp) free(rawp); return 1; } buf_capacity = ((rbBuff[1] << 16) | (rbBuff[2] << 8) | rbBuff[3]); printf("READ BUFFER reports: buffer capacity=%d, offset boundary=%d\n", buf_capacity, (int)rbBuff[0]); if (0 == buf_size) buf_size = buf_capacity; else if (buf_size > buf_capacity) { printf("Requested buffer size=%d exceeds reported capacity=%d\n", buf_size, buf_capacity); if (rawp) free(rawp); return 1; } if (rawp) { free(rawp); rawp = NULL; } if (! do_dio) { k = buf_size; if (do_mmap && (0 != (k % psz))) k = ((k / psz) + 1) * psz; /* round up to page size */ res = ioctl(sg_fd, SG_SET_RESERVED_SIZE, &k); if (res < 0) perror(ME "SG_SET_RESERVED_SIZE error"); } if (do_mmap) { rbBuff = mmap(NULL, buf_size, PROT_READ, MAP_SHARED, sg_fd, 0); if (MAP_FAILED == rbBuff) { if (ENOMEM == errno) printf(ME "mmap() out of memory, try a smaller " "buffer size than %d KiB\n", buf_size / 1024); else perror(ME "error using mmap()"); return 1; } } else { /* non mmap-ed IO */ rawp = malloc(buf_size + (do_dio ? psz : 0)); if (NULL == rawp) { printf(ME "out of memory (data)\n"); return 1; } if (do_dio) /* align to page boundary */ rbBuff= (unsigned char *)(((unsigned long)rawp + psz - 1) & (~(psz - 1))); else rbBuff = rawp; } num = (total_size_mib * 1024U * 1024U) / (unsigned int)buf_size; if (do_time) { start_tm.tv_sec = 0; start_tm.tv_usec = 0; gettimeofday(&start_tm, NULL); } /* main data reading loop */ for (k = 0; k < num; ++k) { memset(rbCmdBlk, 0, RB_CMD_LEN); rbCmdBlk[0] = RB_OPCODE; rbCmdBlk[1] = RB_MODE_DATA; rbCmdBlk[6] = 0xff & (buf_size >> 16); rbCmdBlk[7] = 0xff & (buf_size >> 8); rbCmdBlk[8] = 0xff & buf_size; #ifdef SG_DEBUG memset(rbBuff, 0, buf_size); #endif memset(&io_hdr, 0, sizeof(struct sg_io_hdr)); io_hdr.interface_id = 'S'; io_hdr.cmd_len = sizeof(rbCmdBlk); io_hdr.mx_sb_len = sizeof(sense_buffer); io_hdr.dxfer_direction = SG_DXFER_FROM_DEV; io_hdr.dxfer_len = buf_size; if (! do_mmap) io_hdr.dxferp = rbBuff; io_hdr.cmdp = rbCmdBlk; io_hdr.sbp = sense_buffer; io_hdr.timeout = 20000; /* 20000 millisecs == 20 seconds */ io_hdr.pack_id = k; if (do_mmap) io_hdr.flags |= SG_FLAG_MMAP_IO; else if (do_dio) io_hdr.flags |= SG_FLAG_DIRECT_IO; else if (do_quick) io_hdr.flags |= SG_FLAG_NO_DXFER; if (ioctl(sg_fd, SG_IO, &io_hdr) < 0) { if (ENOMEM == errno) printf(ME "SG_IO data; out of memory, try a smaller " "buffer size than %d KiB\n", buf_size / 1024); else perror(ME "SG_IO READ BUFFER data error"); if (rawp) free(rawp); return 1; } /* now for the error processing */ switch (sg_err_category3(&io_hdr)) { case SG_LIB_CAT_CLEAN: break; case SG_LIB_CAT_RECOVERED: printf("Recovered error on READ BUFFER data, continuing\n"); break; default: /* won't bother decoding other categories */ sg_chk_n_print3("READ BUFFER data error", &io_hdr); if (rawp) free(rawp); return 1; } if (do_dio && ((io_hdr.info & SG_INFO_DIRECT_IO_MASK) != SG_INFO_DIRECT_IO)) dio_incomplete = 1; /* flag that dio not done (completely) */ #ifdef SG_DEBUG if (clear) { for (j = 0; j < buf_size; ++j) { if (rbBuff[j] != 0) { clear = 0; break; } } } #endif } if ((do_time) && (start_tm.tv_sec || start_tm.tv_usec)) { struct timeval res_tm; double a, b; gettimeofday(&end_tm, NULL); res_tm.tv_sec = end_tm.tv_sec - start_tm.tv_sec; res_tm.tv_usec = end_tm.tv_usec - start_tm.tv_usec; if (res_tm.tv_usec < 0) { --res_tm.tv_sec; res_tm.tv_usec += 1000000; } a = res_tm.tv_sec; a += (0.000001 * res_tm.tv_usec); b = (double)buf_size * num; printf("time to read data from buffer was %d.%06d secs", (int)res_tm.tv_sec, (int)res_tm.tv_usec); if ((a > 0.00001) && (b > 511)) printf(", %.2f MB/sec\n", b / (a * 1000000.0)); else printf("\n"); } if (dio_incomplete) printf(">> direct IO requested but not done\n"); printf("Read %u MiB (actual %u MiB, %u bytes), buffer size=%d KiB\n", total_size_mib, (num * buf_size) / 1048576, num * buf_size, buf_size / 1024); if (rawp) free(rawp); res = close(sg_fd); if (res < 0) { perror(ME "close error"); return 1; } #ifdef SG_DEBUG if (clear) printf("read buffer always zero\n"); else printf("read buffer non-zero\n"); #endif return 0; } ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-18 19:14 ` Fajun Chen @ 2007-11-18 19:54 ` Mark Lord 2007-11-18 22:29 ` Fajun Chen 0 siblings, 1 reply; 19+ messages in thread From: Mark Lord @ 2007-11-18 19:54 UTC (permalink / raw) To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo Fajun Chen wrote: >.. > I verified your program works in my system and my application works as > well if changed accordingly. However, this change (indirect IO in sg > term) may come at a performance cost for IO intensive applications > since it does NOT utilize mmaped buffer managed by sg driver. Please > see relevant sg document below: > http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330 > http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio > As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag > in SG_IO. Please see source code attached. I also noticed that > MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not > be desirable as you pointed out in previous emails. So this brings up > an interesting sg usage issue: can we use MAP_ANONYMOUS with > SG_FLAG_MMAP_IO flag in SG_IO? .. The SG_FLAG_MMAP works only with /dev/sg* devices, not /dev/sd* devices. I don't know which kind you were trying to use, since you still have not provided your source code for examination. If you are using /dev/sg*, then you should be able to get your original mmap() code to work. But the behaviour described thus far seems to indicate that your secret program must have been using /dev/sd* instead. Cheers ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-18 19:54 ` Mark Lord @ 2007-11-18 22:29 ` Fajun Chen 2007-11-18 23:07 ` Mark Lord 0 siblings, 1 reply; 19+ messages in thread From: Fajun Chen @ 2007-11-18 22:29 UTC (permalink / raw) To: Mark Lord; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo [-- Attachment #1: Type: text/plain, Size: 2164 bytes --] On 11/18/07, Mark Lord <liml@rtr.ca> wrote: > Fajun Chen wrote: > >.. > > I verified your program works in my system and my application works as > > well if changed accordingly. However, this change (indirect IO in sg > > term) may come at a performance cost for IO intensive applications > > since it does NOT utilize mmaped buffer managed by sg driver. Please > > see relevant sg document below: > > http://sg.torque.net/sg/p/sg_v3_ho.html#id2495330 > > http://sg.torque.net/sg/p/sg_v3_ho.html#dmmio > > As an example, sg_rbuf.c in sg3_util package uses SG_FLAG_MMAP_IO flag > > in SG_IO. Please see source code attached. I also noticed that > > MAP_ANONYMOUS is NOT used in mmap() call in sg_rbuf.c, which may not > > be desirable as you pointed out in previous emails. So this brings up > > an interesting sg usage issue: can we use MAP_ANONYMOUS with > > SG_FLAG_MMAP_IO flag in SG_IO? > .. > > The SG_FLAG_MMAP works only with /dev/sg* devices, not /dev/sd* devices. > I don't know which kind you were trying to use, since you still have > not provided your source code for examination. > > If you are using /dev/sg*, then you should be able to get your original mmap() > code to work. But the behaviour described thus far seems to indicate that > your secret program must have been using /dev/sd* instead. > As a matter of fact, I'm using /dev/sg*. Due to the size of my test application, I have not be able to compress it into a small and publishable form. However, this issue can be easily reproduced on my ARM XScale target using sg3_util code as follows: 1. Run printtime.c attached, which prints message to console in a loop. 2. Run sgm_dd (part of sg3_util package, source code attached) on the same system as follows: >sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1 The print task can be delayed for as many as 25 seconds. Surprisingly, I can't reproduce the problem in an i386 test system with a more powerful processor. Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a bug and more testing, this option seems to make no difference to cpu load. Sorry about previous report. Back to the drawing board now :-) Thanks, Fajun [-- Attachment #2: printtime.c --] [-- Type: application/octet-stream, Size: 551 bytes --] #include <stdio.h> #include <time.h> #include <string.h> void TimeDateStamp(char *current_time) { time_t current_time_seconds; // current time in seconds int size; time(¤t_time_seconds); // get the current time sprintf(current_time,"%s",ctime(¤t_time_seconds)); // convert to time-date stamp size = strlen(current_time); current_time[size-1] = '\0'; } int main() { char ts[80]; int i = 0; while (++i) { TimeDateStamp(ts); printf("[%s]In loop %d\n", ts, i); }; return 0; } [-- Attachment #3: sgm_dd.c --] [-- Type: application/octet-stream, Size: 35309 bytes --] #define _XOPEN_SOURCE 500 #define _GNU_SOURCE #include <unistd.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <signal.h> #include <ctype.h> #include <errno.h> #include <limits.h> #include <sys/ioctl.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/sysmacros.h> #include <sys/mman.h> #include <sys/time.h> #include <linux/major.h> #include <linux/fs.h> #include "sg_include.h" #include "sg_lib.h" #include "sg_cmds.h" #include "llseek.h" /* A utility program for copying files. Specialised for "files" that * represent devices that understand the SCSI command set. * * Copyright (C) 1999 - 2004 D. Gilbert and P. Allworth * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2, or (at your option) * any later version. This program is a specialisation of the Unix "dd" command in which either the input or the output file is a scsi generic device or a raw device. The block size ('bs') is assumed to be 512 if not given. This program complains if 'ibs' or 'obs' are given with a value that differs from 'bs' (or the default 512). If 'if' is not given or 'if=-' then stdin is assumed. If 'of' is not given or 'of=-' then stdout assumed. Multipliers: 'c','C' *1 'b','B' *512 'k' *1024 'K' *1000 'm' *(1024^2) 'M' *(1000^2) 'g' *(1024^3) 'G' *(1000^3) 't' *(1024^4) 'T' *(1000^4) A non-standard argument "bpt" (blocks per transfer) is added to control the maximum number of blocks in each transfer. The default value is 128. For example if "bs=512" and "bpt=32" then a maximum of 32 blocks (16 KiB in this case) is transferred to or from the sg device in a single SCSI command. This version uses memory-mapped IO (i.e. mmap() call from the user space) to speed transfers. If both sides of copy are sg devices then only the read side will be mmap-ed, while the write side will use normal IO. This version is designed for the linux kernel 2.4 and 2.6 series. */ static char * version_str = "1.16 20041102"; #define DEF_BLOCK_SIZE 512 #define DEF_BLOCKS_PER_TRANSFER 128 #define DEF_SCSI_CDBSZ 10 #define MAX_SCSI_CDBSZ 16 #define ME "sgm_dd: " /* #define SG_DEBUG */ #ifndef SG_FLAG_MMAP_IO #define SG_FLAG_MMAP_IO 4 #endif #define SENSE_BUFF_LEN 32 /* Arbitrary, could be larger */ #define READ_CAP_REPLY_LEN 8 #define RCAP16_REPLY_LEN 32 #ifndef SERVICE_ACTION_IN #define SERVICE_ACTION_IN 0x9e #endif #ifndef SAI_READ_CAPACITY_16 #define SAI_READ_CAPACITY_16 0x10 #endif #define DEF_TIMEOUT 60000 /* 60,000 millisecs == 60 seconds */ #ifndef RAW_MAJOR #define RAW_MAJOR 255 /*unlikey value */ #endif #define FT_OTHER 1 /* filetype other than one of following */ #define FT_SG 2 /* filetype is sg char device */ #define FT_RAW 4 /* filetype is raw char device */ #define FT_DEV_NULL 8 /* either "/dev/null" or "." as filename */ #define FT_ST 16 /* filetype is st char device (tape) */ #define FT_BLOCK 32 /* filetype is a block device */ #define DEV_NULL_MINOR_NUM 3 static int sum_of_resids = 0; static long long dd_count = -1; static long long in_full = 0; static int in_partial = 0; static long long out_full = 0; static int out_partial = 0; static const char * proc_allow_dio = "/proc/scsi/sg/allow_dio"; static void install_handler (int sig_num, void (*sig_handler) (int sig)) { struct sigaction sigact; sigaction (sig_num, NULL, &sigact); if (sigact.sa_handler != SIG_IGN) { sigact.sa_handler = sig_handler; sigemptyset (&sigact.sa_mask); sigact.sa_flags = 0; sigaction (sig_num, &sigact, NULL); } } void print_stats() { if (0 != dd_count) fprintf(stderr, " remaining block count=%lld\n", dd_count); fprintf(stderr, "%lld+%d records in\n", in_full - in_partial, in_partial); fprintf(stderr, "%lld+%d records out\n", out_full - out_partial, out_partial); } static void interrupt_handler(int sig) { struct sigaction sigact; sigact.sa_handler = SIG_DFL; sigemptyset (&sigact.sa_mask); sigact.sa_flags = 0; sigaction (sig, &sigact, NULL); fprintf(stderr, "Interrupted by signal,"); print_stats (); kill (getpid (), sig); } static void siginfo_handler(int sig) { sig = sig; /* dummy to stop -W warning messages */ fprintf(stderr, "Progress report, continuing ...\n"); print_stats (); } int dd_filetype(const char * filename) { struct stat st; size_t len = strlen(filename); if ((1 == len) && ('.' == filename[0])) return FT_DEV_NULL; if (stat(filename, &st) < 0) return FT_OTHER; if (S_ISCHR(st.st_mode)) { if ((MEM_MAJOR == major(st.st_rdev)) && (DEV_NULL_MINOR_NUM == minor(st.st_rdev))) return FT_DEV_NULL; if (RAW_MAJOR == major(st.st_rdev)) return FT_RAW; if (SCSI_GENERIC_MAJOR == major(st.st_rdev)) return FT_SG; if (SCSI_TAPE_MAJOR == major(st.st_rdev)) return FT_ST; } else if (S_ISBLK(st.st_mode)) return FT_BLOCK; return FT_OTHER; } void usage() { fprintf(stderr, "Usage: " "sgm_dd [if=<infile>] [skip=<n>] [of=<ofile>] [seek=<n>]\n" " [bs=<num>] [bpt=<num>] [count=<n>] [time=<n>]\n" " [cdbsz=6|10|12|16] [fua=0|1|2|3] [sync=0|1]\n" " [dio=0|1] [--version]\n" " 'bs' must be device block size (default 512)\n" " 'bpt' is blocks_per_transfer (default is 128)\n" " 'time' 0->no timing(def), 1->time plus calculate throughput\n" " 'fua' force unit access: 0->don't(def), 1->of, 2->if, 3->of+if\n" " 'sync' 0->no sync(def), 1->SYNCHRONIZE CACHE after xfer\n" " 'cdbsz' size of SCSI READ or WRITE command (default is 10)\n" " 'dio' 0->indirect IO on write, 1->direct IO on write\n" " (only when read side is sg device (using mmap))\n"); } /* Return of 0 -> success, -1 -> failure, 2 -> try again */ int scsi_read_capacity(int sg_fd, long long * num_sect, int * sect_sz) { int k, res; unsigned char rcBuff[RCAP16_REPLY_LEN]; res = sg_ll_readcap_10(sg_fd, 0, 0, rcBuff, READ_CAP_REPLY_LEN, 0); if (0 != res) return res; if ((0xff == rcBuff[0]) && (0xff == rcBuff[1]) && (0xff == rcBuff[2]) && (0xff == rcBuff[3])) { long long ls; res = sg_ll_readcap_16(sg_fd, 0, 0, rcBuff, RCAP16_REPLY_LEN, 0); if (0 != res) return res; for (k = 0, ls = 0; k < 8; ++k) { ls <<= 8; ls |= rcBuff[k]; } *num_sect = ls + 1; *sect_sz = (rcBuff[8] << 24) | (rcBuff[9] << 16) | (rcBuff[10] << 8) | rcBuff[11]; } else { *num_sect = 1 + ((rcBuff[0] << 24) | (rcBuff[1] << 16) | (rcBuff[2] << 8) | rcBuff[3]); *sect_sz = (rcBuff[4] << 24) | (rcBuff[5] << 16) | (rcBuff[6] << 8) | rcBuff[7]; } return 0; } /* Return of 0 -> success, -1 -> failure */ int read_blkdev_capacity(int sg_fd, long long * num_sect, int * sect_sz) { #ifdef BLKGETSIZE64 unsigned long long ull; if (ioctl(sg_fd, BLKGETSIZE64, &ull) < 0) { perror("BLKGETSIZE64 ioctl error"); return -1; } if (ioctl(sg_fd, BLKSSZGET, sect_sz) < 0) { perror("BLKSSZGET ioctl error"); return -1; } *num_sect = ((long long)ull / (long long)*sect_sz); #else unsigned long ul; if (ioctl(sg_fd, BLKGETSIZE, &ul) < 0) { perror("BLKGETSIZE ioctl error"); return -1; } *num_sect = (long long)ul; if (ioctl(sg_fd, BLKSSZGET, sect_sz) < 0) { perror("BLKSSZGET ioctl error"); return -1; } #endif return 0; } int sg_build_scsi_cdb(unsigned char * cdbp, int cdb_sz, unsigned int blocks, long long start_block, int write_true, int fua, int dpo) { int rd_opcode[] = {0x8, 0x28, 0xa8, 0x88}; int wr_opcode[] = {0xa, 0x2a, 0xaa, 0x8a}; int sz_ind; memset(cdbp, 0, cdb_sz); if (dpo) cdbp[1] |= 0x10; if (fua) cdbp[1] |= 0x8; switch (cdb_sz) { case 6: sz_ind = 0; cdbp[0] = (unsigned char)(write_true ? wr_opcode[sz_ind] : rd_opcode[sz_ind]); cdbp[1] = (unsigned char)((start_block >> 16) & 0x1f); cdbp[2] = (unsigned char)((start_block >> 8) & 0xff); cdbp[3] = (unsigned char)(start_block & 0xff); cdbp[4] = (256 == blocks) ? 0 : (unsigned char)blocks; if (blocks > 256) { fprintf(stderr, ME "for 6 byte commands, maximum number of " "blocks is 256\n"); return 1; } if ((start_block + blocks - 1) & (~0x1fffff)) { fprintf(stderr, ME "for 6 byte commands, can't address blocks" " beyond %d\n", 0x1fffff); return 1; } if (dpo || fua) { fprintf(stderr, ME "for 6 byte commands, neither dpo nor fua" " bits supported\n"); return 1; } break; case 10: sz_ind = 1; cdbp[0] = (unsigned char)(write_true ? wr_opcode[sz_ind] : rd_opcode[sz_ind]); cdbp[2] = (unsigned char)((start_block >> 24) & 0xff); cdbp[3] = (unsigned char)((start_block >> 16) & 0xff); cdbp[4] = (unsigned char)((start_block >> 8) & 0xff); cdbp[5] = (unsigned char)(start_block & 0xff); cdbp[7] = (unsigned char)((blocks >> 8) & 0xff); cdbp[8] = (unsigned char)(blocks & 0xff); if (blocks & (~0xffff)) { fprintf(stderr, ME "for 10 byte commands, maximum number of " "blocks is %d\n", 0xffff); return 1; } break; case 12: sz_ind = 2; cdbp[0] = (unsigned char)(write_true ? wr_opcode[sz_ind] : rd_opcode[sz_ind]); cdbp[2] = (unsigned char)((start_block >> 24) & 0xff); cdbp[3] = (unsigned char)((start_block >> 16) & 0xff); cdbp[4] = (unsigned char)((start_block >> 8) & 0xff); cdbp[5] = (unsigned char)(start_block & 0xff); cdbp[6] = (unsigned char)((blocks >> 24) & 0xff); cdbp[7] = (unsigned char)((blocks >> 16) & 0xff); cdbp[8] = (unsigned char)((blocks >> 8) & 0xff); cdbp[9] = (unsigned char)(blocks & 0xff); break; case 16: sz_ind = 3; cdbp[0] = (unsigned char)(write_true ? wr_opcode[sz_ind] : rd_opcode[sz_ind]); cdbp[2] = (unsigned char)((start_block >> 56) & 0xff); cdbp[3] = (unsigned char)((start_block >> 48) & 0xff); cdbp[4] = (unsigned char)((start_block >> 40) & 0xff); cdbp[5] = (unsigned char)((start_block >> 32) & 0xff); cdbp[6] = (unsigned char)((start_block >> 24) & 0xff); cdbp[7] = (unsigned char)((start_block >> 16) & 0xff); cdbp[8] = (unsigned char)((start_block >> 8) & 0xff); cdbp[9] = (unsigned char)(start_block & 0xff); cdbp[10] = (unsigned char)((blocks >> 24) & 0xff); cdbp[11] = (unsigned char)((blocks >> 16) & 0xff); cdbp[12] = (unsigned char)((blocks >> 8) & 0xff); cdbp[13] = (unsigned char)(blocks & 0xff); break; default: fprintf(stderr, ME "expected cdb size of 6, 10, 12, or 16 but got" "=%d\n", cdb_sz); return 1; } return 0; } /* -1 -> unrecoverable error, 0 -> successful, 1 -> recoverable (ENOMEM), 2 -> try again */ int sg_read(int sg_fd, unsigned char * buff, int blocks, long long from_block, int bs, int cdbsz, int fua, int do_mmap) { unsigned char rdCmd[MAX_SCSI_CDBSZ]; unsigned char senseBuff[SENSE_BUFF_LEN]; struct sg_io_hdr io_hdr; int res; if (sg_build_scsi_cdb(rdCmd, cdbsz, blocks, from_block, 0, fua, 0)) { fprintf(stderr, ME "bad rd cdb build, from_block=%lld, blocks=%d\n", from_block, blocks); return -1; } memset(&io_hdr, 0, sizeof(struct sg_io_hdr)); io_hdr.interface_id = 'S'; io_hdr.cmd_len = cdbsz; io_hdr.cmdp = rdCmd; io_hdr.dxfer_direction = SG_DXFER_FROM_DEV; io_hdr.dxfer_len = bs * blocks; if (! do_mmap) io_hdr.dxferp = buff; io_hdr.mx_sb_len = SENSE_BUFF_LEN; io_hdr.sbp = senseBuff; io_hdr.timeout = DEF_TIMEOUT; io_hdr.pack_id = (int)from_block; if (do_mmap) io_hdr.flags |= SG_FLAG_MMAP_IO; while (((res = write(sg_fd, &io_hdr, sizeof(io_hdr))) < 0) && (EINTR == errno)) ; if (res < 0) { if (ENOMEM == errno) return 1; perror("reading (wr) on sg device, error"); return -1; } while (((res = read(sg_fd, &io_hdr, sizeof(io_hdr))) < 0) && (EINTR == errno)) ; if (res < 0) { perror("reading (rd) on sg device, error"); return -1; } switch (sg_err_category3(&io_hdr)) { case SG_LIB_CAT_CLEAN: break; case SG_LIB_CAT_RECOVERED: fprintf(stderr, "Recovered error while reading block=%lld, num=%d\n", from_block, blocks); break; case SG_LIB_CAT_MEDIA_CHANGED: return 2; default: sg_chk_n_print3("reading", &io_hdr); return -1; } sum_of_resids += io_hdr.resid; #ifdef SG_DEBUG fprintf(stderr, "duration=%u ms\n", io_hdr.duration); #endif return 0; } /* -1 -> unrecoverable error, 0 -> successful, 1 -> recoverable (ENOMEM), 2 -> try again */ int sg_write(int sg_fd, unsigned char * buff, int blocks, long long to_block, int bs, int cdbsz, int fua, int do_mmap, int * diop) { unsigned char wrCmd[MAX_SCSI_CDBSZ]; unsigned char senseBuff[SENSE_BUFF_LEN]; struct sg_io_hdr io_hdr; int res; if (sg_build_scsi_cdb(wrCmd, cdbsz, blocks, to_block, 1, fua, 0)) { fprintf(stderr, ME "bad wr cdb build, to_block=%lld, blocks=%d\n", to_block, blocks); return -1; } memset(&io_hdr, 0, sizeof(struct sg_io_hdr)); io_hdr.interface_id = 'S'; io_hdr.cmd_len = cdbsz; io_hdr.cmdp = wrCmd; io_hdr.dxfer_direction = SG_DXFER_TO_DEV; io_hdr.dxfer_len = bs * blocks; if (! do_mmap) io_hdr.dxferp = buff; io_hdr.mx_sb_len = SENSE_BUFF_LEN; io_hdr.sbp = senseBuff; io_hdr.timeout = DEF_TIMEOUT; io_hdr.pack_id = (int)to_block; if (do_mmap) io_hdr.flags |= SG_FLAG_MMAP_IO; if (diop && *diop) io_hdr.flags |= SG_FLAG_DIRECT_IO; while (((res = write(sg_fd, &io_hdr, sizeof(io_hdr))) < 0) && (EINTR == errno)) ; if (res < 0) { if (ENOMEM == errno) return 1; perror("writing (wr) on sg device, error"); return -1; } while (((res = read(sg_fd, &io_hdr, sizeof(io_hdr))) < 0) && (EINTR == errno)) ; if (res < 0) { perror("writing (rd) on sg device, error"); return -1; } switch (sg_err_category3(&io_hdr)) { case SG_LIB_CAT_CLEAN: break; case SG_LIB_CAT_RECOVERED: fprintf(stderr, "Recovered error while writing block=%lld, num=%d\n", to_block, blocks); break; case SG_LIB_CAT_MEDIA_CHANGED: return 2; default: sg_chk_n_print3("writing", &io_hdr); return -1; } if (diop && *diop && ((io_hdr.info & SG_INFO_DIRECT_IO_MASK) != SG_INFO_DIRECT_IO)) *diop = 0; /* flag that dio not done (completely) */ return 0; } #define STR_SZ 1024 #define INOUTF_SZ 512 #define EBUFF_SZ 512 int main(int argc, char * argv[]) { long long skip = 0; long long seek = 0; int bs = 0; int ibs = 0; int obs = 0; int bpt = DEF_BLOCKS_PER_TRANSFER; char str[STR_SZ]; char * key; char * buf; char inf[INOUTF_SZ]; int in_type = FT_OTHER; char outf[INOUTF_SZ]; int out_type = FT_OTHER; int res, k, t; int infd, outfd, blocks; unsigned char * wrkPos; unsigned char * wrkBuff = NULL; unsigned char * wrkMmap = NULL; long long in_num_sect = -1; int in_res_sz = 0; long long out_num_sect = -1; int out_res_sz = 0; int do_time = 0; int scsi_cdbsz_in = DEF_SCSI_CDBSZ; int scsi_cdbsz_out = DEF_SCSI_CDBSZ; int do_sync = 0; int do_dio = 0; int num_dio_not_done = 0; int fua_mode = 0; int in_sect_sz, out_sect_sz; char ebuff[EBUFF_SZ]; int blocks_per; long long req_count; size_t psz = getpagesize(); struct timeval start_tm, end_tm; inf[0] = '\0'; outf[0] = '\0'; if (argc < 2) { usage(); return 1; } for(k = 1; k < argc; k++) { if (argv[k]) strncpy(str, argv[k], STR_SZ); else continue; for(key = str, buf = key; *buf && *buf != '=';) buf++; if (*buf) *buf++ = '\0'; if (strcmp(key,"if") == 0) { if ('\0' != inf[0]) { fprintf(stderr, "Second 'if=' argument??\n"); return 1; } else strncpy(inf, buf, INOUTF_SZ); } else if (strcmp(key,"of") == 0) { if ('\0' != outf[0]) { fprintf(stderr, "Second 'of=' argument??\n"); return 1; } else strncpy(outf, buf, INOUTF_SZ); } else if (0 == strcmp(key,"ibs")) ibs = sg_get_num(buf); else if (0 == strcmp(key,"obs")) obs = sg_get_num(buf); else if (0 == strcmp(key,"bs")) bs = sg_get_num(buf); else if (0 == strcmp(key,"bpt")) bpt = sg_get_num(buf); else if (0 == strcmp(key,"skip")) skip = sg_get_llnum(buf); else if (0 == strcmp(key,"seek")) seek = sg_get_llnum(buf); else if (0 == strcmp(key,"count")) dd_count = sg_get_llnum(buf); else if (0 == strcmp(key,"time")) do_time = sg_get_num(buf); else if (0 == strcmp(key,"cdbsz")) { scsi_cdbsz_in = sg_get_num(buf); scsi_cdbsz_out = scsi_cdbsz_in; } else if (0 == strcmp(key,"fua")) fua_mode = sg_get_num(buf); else if (0 == strcmp(key,"sync")) do_sync = sg_get_num(buf); else if (0 == strcmp(key,"dio")) do_dio = sg_get_num(buf); else if (0 == strncmp(key, "--vers", 6)) { fprintf(stderr, ME "for Linux sg version 3 driver: %s\n", version_str); return 0; } else { fprintf(stderr, "Unrecognized argument '%s'\n", key); usage(); return 1; } } if (bs <= 0) { bs = DEF_BLOCK_SIZE; fprintf(stderr, "Assume default 'bs' (block size) of %d bytes\n", bs); } if ((ibs && (ibs != bs)) || (obs && (obs != bs))) { fprintf(stderr, "If 'ibs' or 'obs' given must be same as 'bs'\n"); usage(); return 1; } if ((skip < 0) || (seek < 0)) { fprintf(stderr, "skip and seek cannot be negative\n"); return 1; } if (bpt < 1) { fprintf(stderr, "bpt must be greater than 0\n"); return 1; } #ifdef SG_DEBUG fprintf(stderr, ME "if=%s skip=%lld of=%s seek=%lld count=%lld\n", inf, skip, outf, seek, dd_count); #endif install_handler (SIGINT, interrupt_handler); install_handler (SIGQUIT, interrupt_handler); install_handler (SIGPIPE, interrupt_handler); install_handler (SIGUSR1, siginfo_handler); infd = STDIN_FILENO; outfd = STDOUT_FILENO; if (inf[0] && ('-' != inf[0])) { in_type = dd_filetype(inf); if (FT_ST == in_type) { fprintf(stderr, ME "unable to use scsi tape device %s\n", inf); return 1; } else if (FT_SG == in_type) { if ((infd = open(inf, O_RDWR)) < 0) { snprintf(ebuff, EBUFF_SZ, ME "could not open %s for sg reading", inf); perror(ebuff); return 1; } res = ioctl(infd, SG_GET_VERSION_NUM, &t); if ((res < 0) || (t < 30122)) { fprintf(stderr, ME "sg driver prior to 3.1.22\n"); return 1; } in_res_sz = bs * bpt; if (0 != (in_res_sz % psz)) /* round up to next page */ in_res_sz = ((in_res_sz / psz) + 1) * psz; if (ioctl(infd, SG_GET_RESERVED_SIZE, &t) < 0) { perror(ME "SG_GET_RESERVED_SIZE error"); return 1; } if (in_res_sz > t) { if (ioctl(infd, SG_SET_RESERVED_SIZE, &in_res_sz) < 0) { perror(ME "SG_SET_RESERVED_SIZE error"); return 1; } } wrkMmap = mmap(NULL, in_res_sz, PROT_READ | PROT_WRITE, MAP_SHARED, infd, 0); if (MAP_FAILED == wrkMmap) { snprintf(ebuff, EBUFF_SZ, ME "error using mmap() on file: %s", inf); perror(ebuff); return 1; } } else { if ((infd = open(inf, O_RDONLY)) < 0) { snprintf(ebuff, EBUFF_SZ, ME "could not open %s for reading", inf); perror(ebuff); return 1; } else if (skip > 0) { llse_loff_t offset = skip; offset *= bs; /* could exceed 32 bits here! */ if (llse_llseek(infd, offset, SEEK_SET) < 0) { snprintf(ebuff, EBUFF_SZ, ME "couldn't skip to " "required position on %s", inf); perror(ebuff); return 1; } } } } if (outf[0] && ('-' != outf[0])) { out_type = dd_filetype(outf); if (FT_ST == out_type) { fprintf(stderr, ME "unable to use scsi tape device %s\n", outf); return 1; } else if (FT_SG == out_type) { if ((outfd = open(outf, O_RDWR)) < 0) { snprintf(ebuff, EBUFF_SZ, ME "could not open %s for " "sg writing", outf); perror(ebuff); return 1; } res = ioctl(outfd, SG_GET_VERSION_NUM, &t); if ((res < 0) || (t < 30122)) { fprintf(stderr, ME "sg driver prior to 3.1.22\n"); return 1; } if (ioctl(outfd, SG_GET_RESERVED_SIZE, &t) < 0) { perror(ME "SG_GET_RESERVED_SIZE error"); return 1; } out_res_sz = bs * bpt; if (out_res_sz > t) { if (ioctl(outfd, SG_SET_RESERVED_SIZE, &out_res_sz) < 0) { perror(ME "SG_SET_RESERVED_SIZE error"); return 1; } } if (NULL == wrkMmap) { wrkMmap = mmap(NULL, out_res_sz, PROT_READ | PROT_WRITE, MAP_SHARED, outfd, 0); if (MAP_FAILED == wrkMmap) { snprintf(ebuff, EBUFF_SZ, ME "error using mmap() on file: %s", outf); perror(ebuff); return 1; } } } else if (FT_DEV_NULL == out_type) outfd = -1; /* don't bother opening */ else { if (FT_RAW != out_type) { if ((outfd = open(outf, O_WRONLY | O_CREAT, 0666)) < 0) { snprintf(ebuff, EBUFF_SZ, ME "could not open %s for writing", outf); perror(ebuff); return 1; } } else { if ((outfd = open(outf, O_WRONLY)) < 0) { snprintf(ebuff, EBUFF_SZ, ME "could not open %s " "for raw writing", outf); perror(ebuff); return 1; } } if (seek > 0) { llse_loff_t offset = seek; offset *= bs; /* could exceed 32 bits here! */ if (llse_llseek(outfd, offset, SEEK_SET) < 0) { snprintf(ebuff, EBUFF_SZ, ME "couldn't seek to " "required position on %s", outf); perror(ebuff); return 1; } } } } if ((STDIN_FILENO == infd) && (STDOUT_FILENO == outfd)) { fprintf(stderr, "Can't have both 'if' as stdin _and_ 'of' as stdout\n"); return 1; } if (dd_count < 0) { in_num_sect = -1; if (FT_SG == in_type) { res = scsi_read_capacity(infd, &in_num_sect, &in_sect_sz); if (2 == res) { fprintf(stderr, "Unit attention, media changed(in), continuing\n"); res = scsi_read_capacity(infd, &in_num_sect, &in_sect_sz); } if (0 != res) { fprintf(stderr, "Unable to read capacity on %s\n", inf); in_num_sect = -1; } } else if (FT_BLOCK == in_type) { if (0 != read_blkdev_capacity(infd, &in_num_sect, &in_sect_sz)) { fprintf(stderr, "Unable to read block capacity on %s\n", inf); in_num_sect = -1; } if (bs != in_sect_sz) { fprintf(stderr, "block size on %s confusion; bs=%d, from " "device=%d\n", inf, bs, in_sect_sz); in_num_sect = -1; } } if (in_num_sect > skip) in_num_sect -= skip; out_num_sect = -1; if (FT_SG == out_type) { res = scsi_read_capacity(outfd, &out_num_sect, &out_sect_sz); if (2 == res) { fprintf(stderr, "Unit attention, media changed(out), continuing\n"); res = scsi_read_capacity(outfd, &out_num_sect, &out_sect_sz); } if (0 != res) { fprintf(stderr, "Unable to read capacity on %s\n", outf); out_num_sect = -1; } } else if (FT_BLOCK == out_type) { if (0 != read_blkdev_capacity(outfd, &out_num_sect, &out_sect_sz)) { fprintf(stderr, "Unable to read block capacity on %s\n", outf); out_num_sect = -1; } if (bs != out_sect_sz) { fprintf(stderr, "block size on %s confusion: bs=%d, from " "device=%d\n", outf, bs, out_sect_sz); out_num_sect = -1; } } if (out_num_sect > seek) out_num_sect -= seek; #ifdef SG_DEBUG fprintf(stderr, "Start of loop, count=%lld, in_num_sect=%lld, out_num_sect=%lld\n", dd_count, in_num_sect, out_num_sect); #endif if (in_num_sect > 0) { if (out_num_sect > 0) dd_count = (in_num_sect > out_num_sect) ? out_num_sect : in_num_sect; else dd_count = in_num_sect; } else dd_count = out_num_sect; } if (dd_count < 0) { fprintf(stderr, "Couldn't calculate count, please give one\n"); return 1; } if ((FT_SG == in_type) && ((dd_count + skip) > UINT_MAX) && (MAX_SCSI_CDBSZ != scsi_cdbsz_in)) { fprintf(stderr, "Note: SCSI command size increased to 16 bytes " "(for 'if')\n"); scsi_cdbsz_in = MAX_SCSI_CDBSZ; } if ((FT_SG == out_type) && ((dd_count + seek) > UINT_MAX) && (MAX_SCSI_CDBSZ != scsi_cdbsz_out)) { fprintf(stderr, "Note: SCSI command size increased to 16 bytes " "(for 'of')\n"); scsi_cdbsz_out = MAX_SCSI_CDBSZ; } if (do_dio && (FT_SG != in_type)) { do_dio = 0; fprintf(stderr, ">>> dio only performed on 'of' side when 'if' is" " an sg device\n"); } if (do_dio) { int fd; char c; if ((fd = open(proc_allow_dio, O_RDONLY)) >= 0) { if (1 == read(fd, &c, 1)) { if ('0' == c) fprintf(stderr, ">>> %s set to '0' but should be set " "to '1' for direct IO\n", proc_allow_dio); } close(fd); } } if (wrkMmap) wrkPos = wrkMmap; else { if ((FT_RAW == in_type) || (FT_RAW == out_type)) { wrkBuff = malloc(bs * bpt + psz); if (0 == wrkBuff) { fprintf(stderr, "Not enough user memory for raw\n"); return 1; } wrkPos = (unsigned char *)(((unsigned long)wrkBuff + psz - 1) & (~(psz - 1))); } else { wrkBuff = malloc(bs * bpt); if (0 == wrkBuff) { fprintf(stderr, "Not enough user memory\n"); return 1; } wrkPos = wrkBuff; } } blocks_per = bpt; #ifdef SG_DEBUG fprintf(stderr, "Start of loop, count=%lld, blocks_per=%d\n", dd_count, blocks_per); #endif if (do_time) { start_tm.tv_sec = 0; start_tm.tv_usec = 0; gettimeofday(&start_tm, NULL); } req_count = dd_count; while (dd_count > 0) { blocks = (dd_count > blocks_per) ? blocks_per : dd_count; if (FT_SG == in_type) { int fua = fua_mode & 2; res = sg_read(infd, wrkPos, blocks, skip, bs, scsi_cdbsz_in, fua, 1); if (2 == res) { fprintf(stderr, "Unit attention, media changed, continuing (r)\n"); res = sg_read(infd, wrkPos, blocks, skip, bs, scsi_cdbsz_in, fua, 1); } if (0 != res) { fprintf(stderr, "sg_read failed, skip=%lld\n", skip); break; } else in_full += blocks; } else { while (((res = read(infd, wrkPos, blocks * bs)) < 0) && (EINTR == errno)) ; if (res < 0) { snprintf(ebuff, EBUFF_SZ, ME "reading, skip=%lld ", skip); perror(ebuff); break; } else if (res < blocks * bs) { dd_count = 0; blocks = res / bs; if ((res % bs) > 0) { blocks++; in_partial++; } } in_full += blocks; } if (0 == blocks) break; /* read nothing so leave loop */ if (FT_SG == out_type) { int do_mmap = (FT_SG == in_type) ? 0 : 1; int fua = fua_mode & 1; int dio_res = do_dio; res = sg_write(outfd, wrkPos, blocks, seek, bs, scsi_cdbsz_out, fua, do_mmap, &dio_res); if (2 == res) { fprintf(stderr, "Unit attention, media changed, continuing (w)\n"); res = sg_write(outfd, wrkPos, blocks, seek, bs, scsi_cdbsz_out, fua, do_mmap, &dio_res); } else if (0 != res) { fprintf(stderr, "sg_write failed, seek=%lld\n", seek); break; } else { out_full += blocks; if (do_dio && (0 == dio_res)) num_dio_not_done++; } } else if (FT_DEV_NULL == out_type) out_full += blocks; /* act as if written out without error */ else { while (((res = write(outfd, wrkPos, blocks * bs)) < 0) && (EINTR == errno)) ; if (res < 0) { snprintf(ebuff, EBUFF_SZ, ME "writing, seek=%lld ", seek); perror(ebuff); break; } else if (res < blocks * bs) { fprintf(stderr, "output file probably full, seek=%lld ", seek); blocks = res / bs; out_full += blocks; if ((res % bs) > 0) out_partial++; break; } else out_full += blocks; } if (dd_count > 0) dd_count -= blocks; skip += blocks; seek += blocks; } if ((do_time) && (start_tm.tv_sec || start_tm.tv_usec)) { struct timeval res_tm; double a, b; gettimeofday(&end_tm, NULL); res_tm.tv_sec = end_tm.tv_sec - start_tm.tv_sec; res_tm.tv_usec = end_tm.tv_usec - start_tm.tv_usec; if (res_tm.tv_usec < 0) { --res_tm.tv_sec; res_tm.tv_usec += 1000000; } a = res_tm.tv_sec; a += (0.000001 * res_tm.tv_usec); b = (double)bs * (req_count - dd_count); fprintf(stderr, "time to transfer data was %d.%06d secs", (int)res_tm.tv_sec, (int)res_tm.tv_usec); if ((a > 0.00001) && (b > 511)) fprintf(stderr, ", %.2f MB/sec\n", b / (a * 1000000.0)); else fprintf(stderr, "\n"); } if (do_sync) { if (FT_SG == out_type) { fprintf(stderr, ">> Synchronizing cache on %s\n", outf); res = sg_ll_sync_cache_10(outfd, 0, 0, 0, 0, 0, 0, 0); if (2 == res) { fprintf(stderr, "Unit attention, media changed(in), continuing\n"); res = sg_ll_sync_cache_10(outfd, 0, 0, 0, 0, 0, 0, 0); } if (0 != res) fprintf(stderr, "Unable to synchronize cache\n"); } } if (wrkBuff) free(wrkBuff); if (STDIN_FILENO != infd) close(infd); if ((STDOUT_FILENO != outfd) && (FT_DEV_NULL != out_type)) close(outfd); res = 0; if (0 != dd_count) { fprintf(stderr, "Some error occurred,"); res = 2; } print_stats(); if (sum_of_resids) fprintf(stderr, ">> Non-zero sum of residual counts=%d\n", sum_of_resids); if (num_dio_not_done) fprintf(stderr, ">> dio requested but _not done %d times\n", num_dio_not_done); return res; } ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-18 22:29 ` Fajun Chen @ 2007-11-18 23:07 ` Mark Lord 2007-11-19 16:40 ` James Chapman 0 siblings, 1 reply; 19+ messages in thread From: Mark Lord @ 2007-11-18 23:07 UTC (permalink / raw) To: Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo Fajun Chen wrote: > > As a matter of fact, I'm using /dev/sg*. Due to the size of my test > application, I have not be able to compress it into a small and > publishable form. However, this issue can be easily reproduced on my > ARM XScale target using sg3_util code as follows: > 1. Run printtime.c attached, which prints message to console in a loop. > 2. Run sgm_dd (part of sg3_util package, source code attached) on the > same system as follows: >> sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1 > The print task can be delayed for as many as 25 seconds. Surprisingly, > I can't reproduce the problem in an i386 test system with a more > powerful processor. > > Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a > bug and more testing, this option seems to make no difference to cpu > load. Sorry about previous report. Back to the drawing board now :-) .. Okay, I don't see anything unusual here. The code is on a slow CPU, and is triggering 10MBytes of PIO over a (probably) slow bus to an ATA device. This *will* tie up the CPU at 100% for the duration of the I/O, because the I/O happens in interrupt handlers, which are outside of the realm of the CPU scheduler. This is a known shortcoming of Linux for real-time uses. When the I/O uses DMA transfers, it *may* still have a similar effect, depending upon the caching in the ATA device, and on how the DMA shares the memory bus with the CPU. Again, no surprise here. One way to deal with it in an embedded device, is to force the application that's generating the I/O to self-throttle. Or modify the device driver to self-throttle. You may want to find an embedded Linux consultant to help out with this situation if it's beyond your expertise. Cheers ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-18 23:07 ` Mark Lord @ 2007-11-19 16:40 ` James Chapman 2007-11-19 16:51 ` Tejun Heo 0 siblings, 1 reply; 19+ messages in thread From: James Chapman @ 2007-11-19 16:40 UTC (permalink / raw) To: Mark Lord, Fajun Chen; +Cc: linux-ide@vger.kernel.org, linux-scsi, Tejun Heo Mark Lord wrote: > Fajun Chen wrote: >> >> As a matter of fact, I'm using /dev/sg*. Due to the size of my test >> application, I have not be able to compress it into a small and >> publishable form. However, this issue can be easily reproduced on my >> ARM XScale target using sg3_util code as follows: >> 1. Run printtime.c attached, which prints message to console in a loop. >> 2. Run sgm_dd (part of sg3_util package, source code attached) on the >> same system as follows: >>> sgm_dd if=/dev/sg0 of=/dev/null count=10M bpt=1 >> The print task can be delayed for as many as 25 seconds. Surprisingly, >> I can't reproduce the problem in an i386 test system with a more >> powerful processor. >> >> Some clarification to MAP_ANONYMOUS option in mmap(). After fixing a >> bug and more testing, this option seems to make no difference to cpu >> load. Sorry about previous report. Back to the drawing board now :-) > .. > > Okay, I don't see anything unusual here. The code is on a slow CPU, > and is triggering 10MBytes of PIO over a (probably) slow bus to an ATA > device. > > This *will* tie up the CPU at 100% for the duration of the I/O, > because the I/O happens in interrupt handlers, which are outside > of the realm of the CPU scheduler. > > This is a known shortcoming of Linux for real-time uses. > > When the I/O uses DMA transfers, it *may* still have a similar effect, > depending upon the caching in the ATA device, and on how the DMA shares > the memory bus with the CPU. > > Again, no surprise here. > > One way to deal with it in an embedded device, is to force the > application that's generating the I/O to self-throttle. > Or modify the device driver to self-throttle. Does disk access have to be so interrupt driven? Could disk interrupt handling be done in a softirq/kthread like the networking guys deal with network device interrupts? This would prevent the system from live-locking when it is being bombarded with disk IO events. It doesn't seem right that the disk IO subsystem can cause interrupt live-lock on relatively slow CPUs... > You may want to find an embedded Linux consultant to help out > with this situation if it's beyond your expertise. Check out the rtlinux patch, which pushes all interrupt handling out to per-cpu kernel threads (irqd). The kernel scheduler then regains control of what runs when. Another option is to change your ATA driver to do interrupt processing at task level using a workqueue or similar. -- James Chapman Katalix Systems Ltd http://www.katalix.com Catalysts for your Embedded Linux software development ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-19 16:40 ` James Chapman @ 2007-11-19 16:51 ` Tejun Heo 2007-11-19 17:17 ` Alan Cox 0 siblings, 1 reply; 19+ messages in thread From: Tejun Heo @ 2007-11-19 16:51 UTC (permalink / raw) To: James Chapman Cc: Mark Lord, Fajun Chen, linux-ide@vger.kernel.org, linux-scsi James Chapman wrote: > Mark Lord wrote: >> One way to deal with it in an embedded device, is to force the >> application that's generating the I/O to self-throttle. >> Or modify the device driver to self-throttle. > > Does disk access have to be so interrupt driven? Could disk interrupt > handling be done in a softirq/kthread like the networking guys deal with > network device interrupts? This would prevent the system from > live-locking when it is being bombarded with disk IO events. It doesn't > seem right that the disk IO subsystem can cause interrupt live-lock on > relatively slow CPUs... > >> You may want to find an embedded Linux consultant to help out >> with this situation if it's beyond your expertise. > > Check out the rtlinux patch, which pushes all interrupt handling out to > per-cpu kernel threads (irqd). The kernel scheduler then regains control > of what runs when. > > Another option is to change your ATA driver to do interrupt processing > at task level using a workqueue or similar. SFF ATA controllers are peculiar in that... 1. it doesn't have reliable IRQ pending bit. 2. it doesn't have reliable IRQ mask bit. 3. some controllers tank the machine completely if status or data register is accessed differently than the chip likes. So, it's not like we're all dickheads. We know it's good to take those out of irq handler. The hardware just isn't very forgiving and I bet you'll get obscure machine lockups if the RT kernel arbitrarily pushes ATA PIO data transfers into kernel threads. I think doing what IDE has been doing (disabling IRQ from interrupt controller) is the way to go. -- tejun ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Process Scheduling Issue using sg/libata 2007-11-19 16:51 ` Tejun Heo @ 2007-11-19 17:17 ` Alan Cox 0 siblings, 0 replies; 19+ messages in thread From: Alan Cox @ 2007-11-19 17:17 UTC (permalink / raw) To: Tejun Heo Cc: James Chapman, Mark Lord, Fajun Chen, linux-ide@vger.kernel.org, linux-scsi > SFF ATA controllers are peculiar in that... > > 1. it doesn't have reliable IRQ pending bit. > > 2. it doesn't have reliable IRQ mask bit. > > 3. some controllers tank the machine completely if status or data > register is accessed differently than the chip likes. And 4. which is a killer for a lot of RT users An I/O cycle to a taskfile style controller generally goes at ISA type speed down the wire to the drive and back again. The CPU is stalled for this and there is nothing we can do about it. > > So, it's not like we're all dickheads. We know it's good to take those > out of irq handler. The hardware just isn't very forgiving and I bet > you'll get obscure machine lockups if the RT kernel arbitrarily pushes > ATA PIO data transfers into kernel threads. > > I think doing what IDE has been doing (disabling IRQ from interrupt > controller) is the way to go. Agreed - at which point RT or otherwise you can push it out. If you need to do serious (sub 1mS) ATA then also go get a non SFF controller. Alan ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2007-11-19 17:17 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-11-17 0:49 Process Scheduling Issue using sg/libata Fajun Chen 2007-11-17 3:02 ` Tejun Heo 2007-11-17 6:14 ` Fajun Chen 2007-11-17 17:13 ` James Chapman 2007-11-17 19:37 ` Fajun Chen 2007-11-17 4:30 ` Mark Lord 2007-11-17 7:20 ` Fajun Chen 2007-11-17 16:25 ` Mark Lord 2007-11-17 19:20 ` Fajun Chen 2007-11-17 19:55 ` Mark Lord 2007-11-18 6:48 ` Fajun Chen 2007-11-18 14:32 ` Mark Lord 2007-11-18 19:14 ` Fajun Chen 2007-11-18 19:54 ` Mark Lord 2007-11-18 22:29 ` Fajun Chen 2007-11-18 23:07 ` Mark Lord 2007-11-19 16:40 ` James Chapman 2007-11-19 16:51 ` Tejun Heo 2007-11-19 17:17 ` Alan Cox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).