* Maximum NVMe IO command size > 1MB?
@ 2016-01-06 19:23 Xuehua Chen
2016-01-06 19:31 ` Keith Busch
0 siblings, 1 reply; 9+ messages in thread
From: Xuehua Chen @ 2016-01-06 19:23 UTC (permalink / raw)
Hi,
It seems to me kernel 4.3 supports NVMe IO command size > 512k after the following is added.
blk_queue_max_segments(ns->queue,
((dev->max_hw_sectors << 9) / dev->page_size) + 1);
If I run the fllowing,
fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=1M --bs=1M --rw=read
I can see one read with data transfer size 1MB is sent to device.
But if I increase the bs to 2M as below, I still see two 1MB commands are sent out instead of one 2MB read command
fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=2M --bs=2M --rw=read
Is there any other settings in kernel that make it split a 2M command into two 1M commands?
Thanks,
Xuehua
^ permalink raw reply [flat|nested] 9+ messages in thread* Maximum NVMe IO command size > 1MB? 2016-01-06 19:23 Maximum NVMe IO command size > 1MB? Xuehua Chen @ 2016-01-06 19:31 ` Keith Busch 2016-01-06 19:51 ` Xuehua Chen 0 siblings, 1 reply; 9+ messages in thread From: Keith Busch @ 2016-01-06 19:31 UTC (permalink / raw) On Wed, Jan 06, 2016@07:23:53PM +0000, Xuehua Chen wrote: > It seems to me kernel 4.3 supports NVMe IO command size > 512k after the following is added. > > blk_queue_max_segments(ns->queue, > ((dev->max_hw_sectors << 9) / dev->page_size) + 1); > > If I run the fllowing, > fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=1M --bs=1M --rw=read > > I can see one read with data transfer size 1MB is sent to device. > > But if I increase the bs to 2M as below, I still see two 1MB commands are sent out instead of one 2MB read command > fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=2M --bs=2M --rw=read > > Is there any other settings in kernel that make it split a 2M command into two 1M commands? Is the device actually capable of 2MB transfers? You can confirm with: # cat /sys/block/nvme0n1/queue/max_hw_sectors_kb ^ permalink raw reply [flat|nested] 9+ messages in thread
* Maximum NVMe IO command size > 1MB? 2016-01-06 19:31 ` Keith Busch @ 2016-01-06 19:51 ` Xuehua Chen 2016-01-06 21:56 ` Xuehua Chen 2016-01-07 11:39 ` Sagi Grimberg 0 siblings, 2 replies; 9+ messages in thread From: Xuehua Chen @ 2016-01-06 19:51 UTC (permalink / raw) The value is 2048, which seems to be 2MB. ________________________________________ From: Keith Busch [keith.busch@intel.com] Sent: Wednesday, January 6, 2016 11:31 AM To: Xuehua Chen Cc: linux-nvme at lists.infradead.org Subject: Re: Maximum NVMe IO command size > 1MB? On Wed, Jan 06, 2016@07:23:53PM +0000, Xuehua Chen wrote: > It seems to me kernel 4.3 supports NVMe IO command size > 512k after the following is added. > > blk_queue_max_segments(ns->queue, > ((dev->max_hw_sectors << 9) / dev->page_size) + 1); > > If I run the fllowing, > fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=1M --bs=1M --rw=read > > I can see one read with data transfer size 1MB is sent to device. > > But if I increase the bs to 2M as below, I still see two 1MB commands are sent out instead of one 2MB read command > fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=2M --bs=2M --rw=read > > Is there any other settings in kernel that make it split a 2M command into two 1M commands? Is the device actually capable of 2MB transfers? You can confirm with: # cat /sys/block/nvme0n1/queue/max_hw_sectors_kb ^ permalink raw reply [flat|nested] 9+ messages in thread
* Maximum NVMe IO command size > 1MB? 2016-01-06 19:51 ` Xuehua Chen @ 2016-01-06 21:56 ` Xuehua Chen 2016-01-06 22:54 ` Keith Busch 2016-01-07 11:39 ` Sagi Grimberg 1 sibling, 1 reply; 9+ messages in thread From: Xuehua Chen @ 2016-01-06 21:56 UTC (permalink / raw) Hi, Keith, I wonder whether this could be caused by BIO_MAX_PAGES defined as 256, which means 1MB at most. What do you think? Xuehua ________________________________________ From: Linux-nvme [linux-nvme-bounces@lists.infradead.org] on behalf of Xuehua Chen [xuehua@marvell.com] Sent: Wednesday, January 6, 2016 11:51 AM To: Keith Busch Cc: linux-nvme at lists.infradead.org Subject: RE: Maximum NVMe IO command size > 1MB? The value is 2048, which seems to be 2MB. ________________________________________ From: Keith Busch [keith.busch@intel.com] Sent: Wednesday, January 6, 2016 11:31 AM To: Xuehua Chen Cc: linux-nvme at lists.infradead.org Subject: Re: Maximum NVMe IO command size > 1MB? On Wed, Jan 06, 2016@07:23:53PM +0000, Xuehua Chen wrote: > It seems to me kernel 4.3 supports NVMe IO command size > 512k after the following is added. > > blk_queue_max_segments(ns->queue, > ((dev->max_hw_sectors << 9) / dev->page_size) + 1); > > If I run the fllowing, > fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=1M --bs=1M --rw=read > > I can see one read with data transfer size 1MB is sent to device. > > But if I increase the bs to 2M as below, I still see two 1MB commands are sent out instead of one 2MB read command > fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=2M --bs=2M --rw=read > > Is there any other settings in kernel that make it split a 2M command into two 1M commands? Is the device actually capable of 2MB transfers? You can confirm with: # cat /sys/block/nvme0n1/queue/max_hw_sectors_kb _______________________________________________ Linux-nvme mailing list Linux-nvme at lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 9+ messages in thread
* Maximum NVMe IO command size > 1MB? 2016-01-06 21:56 ` Xuehua Chen @ 2016-01-06 22:54 ` Keith Busch 2016-01-07 17:38 ` Xuehua Chen 2016-01-10 22:16 ` Xuehua Chen 0 siblings, 2 replies; 9+ messages in thread From: Keith Busch @ 2016-01-06 22:54 UTC (permalink / raw) On Wed, Jan 06, 2016@09:56:24PM +0000, Xuehua Chen wrote: > Hi, Keith, > > I wonder whether this could be caused by BIO_MAX_PAGES defined as 256, which means 1MB at most. > What do you think? I think you got it. You're running O_DIRECT, and fs/direct-io.c, dio_new_bio() allocates up to BIO_MAX_PAGES. I can't tell where the value for came from (looks like it was there from the very first git commit), but maybe you can propose raising it if you set BIO_MAX_PAGES higher without issue. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Maximum NVMe IO command size > 1MB? 2016-01-06 22:54 ` Keith Busch @ 2016-01-07 17:38 ` Xuehua Chen 2016-01-10 22:16 ` Xuehua Chen 1 sibling, 0 replies; 9+ messages in thread From: Xuehua Chen @ 2016-01-07 17:38 UTC (permalink / raw) Thanks, will try raising it and see how it go. ________________________________________ From: Keith Busch [keith.busch@intel.com] Sent: Wednesday, January 6, 2016 2:54 PM To: Xuehua Chen Cc: linux-nvme at lists.infradead.org Subject: Re: Maximum NVMe IO command size > 1MB? On Wed, Jan 06, 2016@09:56:24PM +0000, Xuehua Chen wrote: > Hi, Keith, > > I wonder whether this could be caused by BIO_MAX_PAGES defined as 256, which means 1MB at most. > What do you think? I think you got it. You're running O_DIRECT, and fs/direct-io.c, dio_new_bio() allocates up to BIO_MAX_PAGES. I can't tell where the value for came from (looks like it was there from the very first git commit), but maybe you can propose raising it if you set BIO_MAX_PAGES higher without issue. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Maximum NVMe IO command size > 1MB? 2016-01-06 22:54 ` Keith Busch 2016-01-07 17:38 ` Xuehua Chen @ 2016-01-10 22:16 ` Xuehua Chen 1 sibling, 0 replies; 9+ messages in thread From: Xuehua Chen @ 2016-01-10 22:16 UTC (permalink / raw) Yes, dio_new_bio() caused the splitting. Tried raising BIO_MAX_PAGES to 512 and run the command below again. fio --name=iotest --filename=/dev/nvme0n1 --iodepth=1 --ioengine=libaio --direct=1 --size=2M --bs=2M --rw=read It is found one 1280K command and one 768K command are sent instead of two 1M commands. It seems new BIO_MAX_PAGES takes effect and there is another factor cause the command to split. The splitting seems to be caused by the value of /sys/block/nvme0n1/queue/max_sectors_kb, which is 1280. After changing its value to 2048, one 2M command is sent. Also tried increasing iodepth to 512 and size to 1G and run multiple times, it runs well. Below is the description of max_sectors_kb in queue-sysfs.txt max_sectors_kb (RW) ------------------- This is the maximum number of kilobytes that the block layer will allow for a filesystem request. Must be smaller than or equal to the maximum size allowed by the hardware. It seems that BIO_MAX_PAGES and max_sectors_kb are two more factors that limit the maximum size of a transfer. One thing that caught my attention is max_sectors_kb is determined by BLk_DEF_MAX_SECTORS, which is defined as 2560 in blkdev.h. It seems that it does not show accurately the maximum size of a transfer, 1028KB for kernel 4.3 due to the current value of BIO_MAX_PAGES, 256. Based on the findings, I would propose the below changes. 1. Change BLK_DEF_MAX_SECTORS from 2560 to BIO_MAX_SECTORS(2048). 2. Previously users can change max_sectors_kb to any value which is smaller than or equal to that of max_hw_sectors_kb. Change the behavior so that users cannot change it to any value which is bigger than the minimum limit determined by both max_hw_sectors_kb and BIO_MAX_SECTORS. 3. Update queue-sysfs.txt for item max_sectors_kb to also mention the limit caused by BIO_MAX_SECTORS. 4. Possibly add an configuration option for kernel to support BIO size of 2MB or more. Any comments? -----Original Message----- From: Keith Busch [mailto:keith.busch@intel.com] Sent: Wednesday, January 06, 2016 2:55 PM To: Xuehua Chen Cc: linux-nvme at lists.infradead.org Subject: Re: Maximum NVMe IO command size > 1MB? On Wed, Jan 06, 2016@09:56:24PM +0000, Xuehua Chen wrote: > Hi, Keith, > > I wonder whether this could be caused by BIO_MAX_PAGES defined as 256, which means 1MB at most. > What do you think? I think you got it. You're running O_DIRECT, and fs/direct-io.c, dio_new_bio() allocates up to BIO_MAX_PAGES. I can't tell where the value for came from (looks like it was there from the very first git commit), but maybe you can propose raising it if you set BIO_MAX_PAGES higher without issue. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Maximum NVMe IO command size > 1MB? 2016-01-06 19:51 ` Xuehua Chen 2016-01-06 21:56 ` Xuehua Chen @ 2016-01-07 11:39 ` Sagi Grimberg 2016-01-07 17:34 ` Xuehua Chen 1 sibling, 1 reply; 9+ messages in thread From: Sagi Grimberg @ 2016-01-07 11:39 UTC (permalink / raw) > The value is 2048, which seems to be 2MB. I think 2048 is 1MB... max_transfer_size = max_sectors_kb * sector_size = 2048K * 512 = 1MB I don't think it's coming from any other limitation in the block layer since I'm able to transfer 8M and more in a single request with iser. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Maximum NVMe IO command size > 1MB? 2016-01-07 11:39 ` Sagi Grimberg @ 2016-01-07 17:34 ` Xuehua Chen 0 siblings, 0 replies; 9+ messages in thread From: Xuehua Chen @ 2016-01-07 17:34 UTC (permalink / raw) >I think 2048 is 1MB... Could you double check? >From https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt max_hw_sectors_kb (RO) ---------------------- This is the maximum number of kilobytes supported in a single data transfer. > I don't think it's coming from any other limitation in the block layer since I'm able to transfer 8M and more in a single request with iser. Is this transfer done via a single bio? How did you determine the single transfer size? ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2016-01-10 22:16 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-01-06 19:23 Maximum NVMe IO command size > 1MB? Xuehua Chen 2016-01-06 19:31 ` Keith Busch 2016-01-06 19:51 ` Xuehua Chen 2016-01-06 21:56 ` Xuehua Chen 2016-01-06 22:54 ` Keith Busch 2016-01-07 17:38 ` Xuehua Chen 2016-01-10 22:16 ` Xuehua Chen 2016-01-07 11:39 ` Sagi Grimberg 2016-01-07 17:34 ` Xuehua Chen
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.