* [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers @ 2017-01-11 13:43 Johannes Thumshirn 2017-01-11 13:46 ` Hannes Reinecke ` (2 more replies) 0 siblings, 3 replies; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-11 13:43 UTC (permalink / raw) To: lsf-pc@lists.linux-foundation.org Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch Hi all, I'd like to attend LSF/MM and would like to discuss polling for block drivers. Currently there is blk-iopoll but it is neither as widely used as NAPI in the networking field and accoring to Sagi's findings in [1] performance with polling is not on par with IRQ usage. On LSF/MM I'd like to whether it is desirable to have NAPI like polling in more block drivers and how to overcome the currently seen performance issues. [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.html Byte, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 13:43 [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers Johannes Thumshirn @ 2017-01-11 13:46 ` Hannes Reinecke 2017-01-11 15:07 ` Jens Axboe 2017-01-11 16:08 ` Bart Van Assche 2 siblings, 0 replies; 50+ messages in thread From: Hannes Reinecke @ 2017-01-11 13:46 UTC (permalink / raw) To: Johannes Thumshirn, lsf-pc@lists.linux-foundation.org Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch On 01/11/2017 02:43 PM, Johannes Thumshirn wrote: > Hi all, > > I'd like to attend LSF/MM and would like to discuss polling for block drivers. > > Currently there is blk-iopoll but it is neither as widely used as NAPI in the > networking field and accoring to Sagi's findings in [1] performance with > polling is not on par with IRQ usage. > > On LSF/MM I'd like to whether it is desirable to have NAPI like polling in > more block drivers and how to overcome the currently seen performance issues. > > [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.html > Yup. I'm all for it. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N�rnberg) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 13:43 [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers Johannes Thumshirn 2017-01-11 13:46 ` Hannes Reinecke @ 2017-01-11 15:07 ` Jens Axboe 2017-01-11 15:13 ` Jens Axboe ` (3 more replies) 2017-01-11 16:08 ` Bart Van Assche 2 siblings, 4 replies; 50+ messages in thread From: Jens Axboe @ 2017-01-11 15:07 UTC (permalink / raw) To: Johannes Thumshirn, lsf-pc@lists.linux-foundation.org Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch On 01/11/2017 06:43 AM, Johannes Thumshirn wrote: > Hi all, > > I'd like to attend LSF/MM and would like to discuss polling for block drivers. > > Currently there is blk-iopoll but it is neither as widely used as NAPI in the > networking field and accoring to Sagi's findings in [1] performance with > polling is not on par with IRQ usage. > > On LSF/MM I'd like to whether it is desirable to have NAPI like polling in > more block drivers and how to overcome the currently seen performance issues. It would be an interesting topic to discuss, as it is a shame that blk-iopoll isn't used more widely. -- Jens Axboe ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 15:07 ` Jens Axboe @ 2017-01-11 15:13 ` Jens Axboe 2017-01-12 8:23 ` Sagi Grimberg 2017-01-13 15:56 ` Johannes Thumshirn 2017-01-11 15:16 ` Hannes Reinecke ` (2 subsequent siblings) 3 siblings, 2 replies; 50+ messages in thread From: Jens Axboe @ 2017-01-11 15:13 UTC (permalink / raw) To: Johannes Thumshirn, lsf-pc@lists.linux-foundation.org Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch On 01/11/2017 08:07 AM, Jens Axboe wrote: > On 01/11/2017 06:43 AM, Johannes Thumshirn wrote: >> Hi all, >> >> I'd like to attend LSF/MM and would like to discuss polling for block drivers. >> >> Currently there is blk-iopoll but it is neither as widely used as NAPI in the >> networking field and accoring to Sagi's findings in [1] performance with >> polling is not on par with IRQ usage. >> >> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in >> more block drivers and how to overcome the currently seen performance issues. > > It would be an interesting topic to discuss, as it is a shame that blk-iopoll > isn't used more widely. Forgot to mention - it should only be a topic, if experimentation has been done and results gathered to pin point what the issues are, so we have something concrete to discus. I'm not at all interested in a hand wavy discussion on the topic. -- Jens Axboe ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 15:13 ` Jens Axboe @ 2017-01-12 8:23 ` Sagi Grimberg 2017-01-12 10:02 ` Johannes Thumshirn 2017-01-17 15:38 ` Sagi Grimberg 2017-01-13 15:56 ` Johannes Thumshirn 1 sibling, 2 replies; 50+ messages in thread From: Sagi Grimberg @ 2017-01-12 8:23 UTC (permalink / raw) To: Jens Axboe, Johannes Thumshirn, lsf-pc@lists.linux-foundation.org Cc: linux-block, Linux-scsi, linux-nvme, Christoph Hellwig, Keith Busch >>> Hi all, >>> >>> I'd like to attend LSF/MM and would like to discuss polling for block drivers. >>> >>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the >>> networking field and accoring to Sagi's findings in [1] performance with >>> polling is not on par with IRQ usage. >>> >>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in >>> more block drivers and how to overcome the currently seen performance issues. >> >> It would be an interesting topic to discuss, as it is a shame that blk-iopoll >> isn't used more widely. > > Forgot to mention - it should only be a topic, if experimentation has > been done and results gathered to pin point what the issues are, so we > have something concrete to discus. I'm not at all interested in a hand > wavy discussion on the topic. > Hey all, Indeed I attempted to convert nvme to use irq-poll (let's use its new name) but experienced some unexplained performance degradations. Keith reported a 700ns degradation for QD=1 with his Xpoint devices, this sort of degradation are acceptable I guess because we do schedule a soft-irq before consuming the completion, but I noticed ~10% IOPs degradation fr QD=32 which is not acceptable. I agree with Jens that we'll need some analysis if we want the discussion to be affective, and I can spend some time this if I can find volunteers with high-end nvme devices (I only have access to client nvme devices. I can add debugfs statistics on average the number of completions I consume per intererupt, I can also trace the interrupt and the soft-irq start,end. Any other interesting stats I can add? I also tried a hybrid mode where the first 4 completions were handled in the interrupt and the rest in soft-irq but that didn't make much of a difference. Any other thoughts? ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-12 8:23 ` Sagi Grimberg @ 2017-01-12 10:02 ` Johannes Thumshirn 2017-01-12 11:44 ` Sagi Grimberg 2017-01-17 15:38 ` Sagi Grimberg 1 sibling, 1 reply; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-12 10:02 UTC (permalink / raw) To: Sagi Grimberg Cc: Jens Axboe, lsf-pc@lists.linux-foundation.org, linux-block, Christoph Hellwig, Keith Busch, linux-nvme, Linux-scsi On Thu, Jan 12, 2017 at 10:23:47AM +0200, Sagi Grimberg wrote: > > >>>Hi all, > >>> > >>>I'd like to attend LSF/MM and would like to discuss polling for block drivers. > >>> > >>>Currently there is blk-iopoll but it is neither as widely used as NAPI in the > >>>networking field and accoring to Sagi's findings in [1] performance with > >>>polling is not on par with IRQ usage. > >>> > >>>On LSF/MM I'd like to whether it is desirable to have NAPI like polling in > >>>more block drivers and how to overcome the currently seen performance issues. > >> > >>It would be an interesting topic to discuss, as it is a shame that blk-iopoll > >>isn't used more widely. > > > >Forgot to mention - it should only be a topic, if experimentation has > >been done and results gathered to pin point what the issues are, so we > >have something concrete to discus. I'm not at all interested in a hand > >wavy discussion on the topic. > > > > Hey all, > > Indeed I attempted to convert nvme to use irq-poll (let's use its > new name) but experienced some unexplained performance degradations. > > Keith reported a 700ns degradation for QD=1 with his Xpoint devices, > this sort of degradation are acceptable I guess because we do schedule > a soft-irq before consuming the completion, but I noticed ~10% IOPs > degradation fr QD=32 which is not acceptable. > > I agree with Jens that we'll need some analysis if we want the > discussion to be affective, and I can spend some time this if I > can find volunteers with high-end nvme devices (I only have access > to client nvme devices. I have a P3700 but somehow burned the FW. Let me see if I can bring it back to live. I also have converted AHCI to the irq_poll interface and will run some tests. I do also have some hpsa devices on which I could run tests once the driver is adopted. But can we come to a common testing methology not to compare apples with oranges? Sagi do you still have the fio job file from your last tests laying somewhere and if yes could you share it? Byte, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-12 10:02 ` Johannes Thumshirn @ 2017-01-12 11:44 ` Sagi Grimberg 2017-01-12 12:53 ` Johannes Thumshirn 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2017-01-12 11:44 UTC (permalink / raw) To: Johannes Thumshirn Cc: Jens Axboe, lsf-pc@lists.linux-foundation.org, linux-block, Christoph Hellwig, Keith Busch, linux-nvme, Linux-scsi >> I agree with Jens that we'll need some analysis if we want the >> discussion to be affective, and I can spend some time this if I >> can find volunteers with high-end nvme devices (I only have access >> to client nvme devices. > > I have a P3700 but somehow burned the FW. Let me see if I can bring it back to > live. > > I also have converted AHCI to the irq_poll interface and will run some tests. > I do also have some hpsa devices on which I could run tests once the driver is > adopted. > > But can we come to a common testing methology not to compare apples with > oranges? Sagi do you still have the fio job file from your last tests laying > somewhere and if yes could you share it? Its pretty basic: -- [global] group_reporting cpus_allowed=0 cpus_allowed_policy=split rw=randrw bs=4k numjobs=4 iodepth=32 runtime=60 time_based loops=1 ioengine=libaio direct=1 invalidate=1 randrepeat=1 norandommap exitall [job] -- **Note: when I ran multiple threads on more cpus the performance degradation phenomenon disappeared, but I tested on a VM with qemu emulation backed by null_blk so I figured I had some other bottleneck somewhere (that's why I asked for some more testing). Note that I ran randrw because I was backed with null_blk, testing with a real nvme device, you should either run randread or write, and if you do a write, you can't run it multi-threaded (well you can, but you'll get unpredictable performance...). ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-12 11:44 ` Sagi Grimberg @ 2017-01-12 12:53 ` Johannes Thumshirn 2017-01-12 14:41 ` [Lsf-pc] " Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-12 12:53 UTC (permalink / raw) To: Sagi Grimberg Cc: Jens Axboe, lsf-pc@lists.linux-foundation.org, linux-block, Christoph Hellwig, Keith Busch, linux-nvme, Linux-scsi On Thu, Jan 12, 2017 at 01:44:05PM +0200, Sagi Grimberg wrote: [...] > Its pretty basic: > -- > [global] > group_reporting > cpus_allowed=0 > cpus_allowed_policy=split > rw=randrw > bs=4k > numjobs=4 > iodepth=32 > runtime=60 > time_based > loops=1 > ioengine=libaio > direct=1 > invalidate=1 > randrepeat=1 > norandommap > exitall > > [job] > -- > > **Note: when I ran multiple threads on more cpus the performance > degradation phenomenon disappeared, but I tested on a VM with > qemu emulation backed by null_blk so I figured I had some other > bottleneck somewhere (that's why I asked for some more testing). That could be because of the vmexits as every MMIO access in the guest triggers a vmexit and if you poll with a low budget you do more MMIOs hence you have more vmexits. Did you do testing only in qemu or with real H/W as well? > > Note that I ran randrw because I was backed with null_blk, testing > with a real nvme device, you should either run randread or write, and > if you do a write, you can't run it multi-threaded (well you can, but > you'll get unpredictable performance...). Noted, thanks. Byte, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-12 12:53 ` Johannes Thumshirn @ 2017-01-12 14:41 ` Sagi Grimberg 2017-01-12 18:59 ` Johannes Thumshirn 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2017-01-12 14:41 UTC (permalink / raw) To: Johannes Thumshirn Cc: Jens Axboe, Keith Busch, Linux-scsi, Christoph Hellwig, linux-nvme, linux-block, lsf-pc@lists.linux-foundation.org >> **Note: when I ran multiple threads on more cpus the performance >> degradation phenomenon disappeared, but I tested on a VM with >> qemu emulation backed by null_blk so I figured I had some other >> bottleneck somewhere (that's why I asked for some more testing). > > That could be because of the vmexits as every MMIO access in the guest > triggers a vmexit and if you poll with a low budget you do more MMIOs hence > you have more vmexits. > > Did you do testing only in qemu or with real H/W as well? I tried once. IIRC, I saw the same phenomenons... ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-12 14:41 ` [Lsf-pc] " Sagi Grimberg @ 2017-01-12 18:59 ` Johannes Thumshirn 0 siblings, 0 replies; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-12 18:59 UTC (permalink / raw) To: Sagi Grimberg Cc: Jens Axboe, Keith Busch, Linux-scsi, Christoph Hellwig, linux-nvme, linux-block, lsf-pc@lists.linux-foundation.org On Thu, Jan 12, 2017 at 04:41:00PM +0200, Sagi Grimberg wrote: > > >>**Note: when I ran multiple threads on more cpus the performance > >>degradation phenomenon disappeared, but I tested on a VM with > >>qemu emulation backed by null_blk so I figured I had some other > >>bottleneck somewhere (that's why I asked for some more testing). > > > >That could be because of the vmexits as every MMIO access in the guest > >triggers a vmexit and if you poll with a low budget you do more MMIOs hence > >you have more vmexits. > > > >Did you do testing only in qemu or with real H/W as well? > > I tried once. IIRC, I saw the same phenomenons... JFTR I tried my AHCI irq_poll patch on the Qemu emulation and the read throughput dropped from ~1GB/s to ~350MB/s. But this can be related to Qemu's I/O wiredness as well I think. I'll try on real hardware tomorrow. -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-12 8:23 ` Sagi Grimberg 2017-01-12 10:02 ` Johannes Thumshirn @ 2017-01-17 15:38 ` Sagi Grimberg 2017-01-17 15:45 ` Sagi Grimberg ` (2 more replies) 1 sibling, 3 replies; 50+ messages in thread From: Sagi Grimberg @ 2017-01-17 15:38 UTC (permalink / raw) To: Jens Axboe, Johannes Thumshirn, lsf-pc@lists.linux-foundation.org Cc: linux-block, Linux-scsi, linux-nvme, Christoph Hellwig, Keith Busch [-- Attachment #1: Type: text/plain, Size: 5615 bytes --] Hey, so I made some initial analysis of whats going on with irq-poll. First, I sampled how much time it takes before we get the interrupt in nvme_irq and the initial visit to nvme_irqpoll_handler. I ran a single threaded fio with QD=32 of 4K reads. This is two displays of a histogram of the latency (ns): -- [1] queue = b'nvme0q1' usecs : count distribution 0 -> 1 : 7310 |****************************************| 2 -> 3 : 11 | | 4 -> 7 : 10 | | 8 -> 15 : 20 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 1 | | [2] queue = b'nvme0q1' usecs : count distribution 0 -> 1 : 7309 |****************************************| 2 -> 3 : 14 | | 4 -> 7 : 7 | | 8 -> 15 : 17 | | We can see that most of the time our latency is pretty good (<1ns) but with huge tail latencies (some 8-15 ns and even one in 32-63 ns). **NOTE, in order to reduce the tracing impact on performance I sampled for every 100 interrupts. I also sampled for a multiple threads/queues with QD=32 of 4K reads. This is a collection of histograms for 5 queues (5 fio threads): queue = b'nvme0q1' usecs : count distribution 0 -> 1 : 701 |****************************************| 2 -> 3 : 177 |********** | 4 -> 7 : 56 |*** | 8 -> 15 : 24 |* | 16 -> 31 : 6 | | 32 -> 63 : 1 | | queue = b'nvme0q2' usecs : count distribution 0 -> 1 : 412 |****************************************| 2 -> 3 : 52 |***** | 4 -> 7 : 19 |* | 8 -> 15 : 13 |* | 16 -> 31 : 5 | | queue = b'nvme0q3' usecs : count distribution 0 -> 1 : 381 |****************************************| 2 -> 3 : 74 |******* | 4 -> 7 : 26 |** | 8 -> 15 : 12 |* | 16 -> 31 : 3 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 1 | | queue = b'nvme0q4' usecs : count distribution 0 -> 1 : 386 |****************************************| 2 -> 3 : 63 |****** | 4 -> 7 : 30 |*** | 8 -> 15 : 11 |* | 16 -> 31 : 7 | | 32 -> 63 : 1 | | queue = b'nvme0q5' usecs : count distribution 0 -> 1 : 384 |****************************************| 2 -> 3 : 69 |******* | 4 -> 7 : 25 |** | 8 -> 15 : 15 |* | 16 -> 31 : 3 | | Overall looks pretty much the same but some more samples with tails... Next, I sampled how many completions we are able to consume per interrupt. Two exaples of histograms of how many completions we take per interrupt. -- queue = b'nvme0q1' completed : count distribution 0 : 0 | | 1 : 11690 |****************************************| 2 : 46 | | 3 : 1 | | queue = b'nvme0q1' completed : count distribution 0 : 0 | | 1 : 944 |****************************************| 2 : 8 | | -- So it looks like we are super not efficient because most of the times we catch 1 completion per interrupt and the whole point is that we need to find more! This fio is single threaded with QD=32 so I'd expect that we be somewhere in 8-31 almost all the time... I also tried QD=1024, histogram is still the same. **NOTE: Here I also sampled for every 100 interrupts. I'll try to run the counter on the current nvme driver and see what I get. I attached the bpf scripts I wrote (nvme-trace-irq, nvme-count-comps) with hope that someone is interested enough to try and reproduce these numbers on his/hers setup and maybe suggest some other useful tracing we can do. Prerequisites: 1. iovisor is needed for python bpf support. $ echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list $ sudo apt-get update -y $ sudo apt-get install bcc-tools -y # Nastty hack .. bcc only available in python2 but copliant with python3.. $ sudo cp -r /usr/lib/python2.7/dist-packages/bcc /usr/lib/python3/dist-packages/ 2. Because we don't have the nvme-pci symbols exported, The nvme.h file is needed on the test machine (where the bpf code is running). I used nfs mount for the linux source (this is why I include from /mnt/linux in the scripts). [-- Attachment #2: nvme-count-comps --] [-- Type: text/plain, Size: 6398 bytes --] #!/usr/bin/python3 # @lint-avoid-python-3-compatibility-imports from __future__ import print_function from bcc import BPF from time import sleep, strftime import argparse # arguments examples = """examples: ./nvme_comp_cout # summarize interrupt->irqpoll latency as a histogram ./nvme_comp_cout 1 10 # print 1 second summaries, 10 times ./nvme_comp_cout -mT 1 # 1s summaries, milliseconds, and timestamps ./nvme_comp_cout -Q # show each nvme queue device separately """ parser = argparse.ArgumentParser( description="Summarize block device I/O latency as a histogram", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=examples) parser.add_argument("-T", "--timestamp", action="store_true", help="include timestamp on output") parser.add_argument("-m", "--milliseconds", action="store_true", help="millisecond histogram") parser.add_argument("-Q", "--queues", action="store_true", help="print a histogram per queue") parser.add_argument("--freq", help="Account every N-th request", type=int, required=False) parser.add_argument("interval", nargs="?", default=2, help="output interval, in seconds") parser.add_argument("count", nargs="?", default=99999999, help="number of outputs") args = parser.parse_args() countdown = int(args.count) debug = 0 # define BPF program bpf_text = """ #include <uapi/linux/ptrace.h> /***************************************************************** * Nasty hack because we don't have the nvme-pci structs exported *****************************************************************/ #include <linux/aer.h> #include <linux/bitops.h> #include <linux/blkdev.h> #include <linux/blk-mq.h> #include <linux/blk-mq-pci.h> #include <linux/cpu.h> #include <linux/delay.h> #include <linux/errno.h> #include <linux/fs.h> #include <linux/genhd.h> #include <linux/hdreg.h> #include <linux/idr.h> #include <linux/init.h> #include <linux/interrupt.h> #include <linux/io.h> #include <linux/kdev_t.h> #include <linux/kernel.h> #include <linux/mm.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/mutex.h> #include <linux/pci.h> #include <linux/poison.h> #include <linux/ptrace.h> #include <linux/sched.h> #include <linux/slab.h> #include <linux/t10-pi.h> #include <linux/timer.h> #include <linux/types.h> #include <linux/io-64-nonatomic-lo-hi.h> #include <asm/unaligned.h> #include <linux/irq_poll.h> #include "/mnt/linux/drivers/nvme/host/nvme.h" struct nvme_dev; struct nvme_queue; /* * Represents an NVM Express device. Each nvme_dev is a PCI function. */ struct nvme_dev { struct nvme_queue **queues; struct blk_mq_tag_set tagset; struct blk_mq_tag_set admin_tagset; u32 __iomem *dbs; struct device *dev; struct dma_pool *prp_page_pool; struct dma_pool *prp_small_pool; unsigned queue_count; unsigned online_queues; unsigned max_qid; int q_depth; u32 db_stride; void __iomem *bar; struct work_struct reset_work; struct work_struct remove_work; struct timer_list watchdog_timer; struct mutex shutdown_lock; bool subsystem; void __iomem *cmb; dma_addr_t cmb_dma_addr; u64 cmb_size; u32 cmbsz; u32 cmbloc; struct nvme_ctrl ctrl; struct completion ioq_wait; }; /* * An NVM Express queue. Each device has at least two (one for admin * commands and one for I/O commands). */ struct nvme_queue { struct device *q_dmadev; struct nvme_dev *dev; char irqname[24]; spinlock_t sq_lock; spinlock_t cq_lock; struct nvme_command *sq_cmds; struct nvme_command __iomem *sq_cmds_io; volatile struct nvme_completion *cqes; struct blk_mq_tags **tags; dma_addr_t sq_dma_addr; dma_addr_t cq_dma_addr; u32 __iomem *q_db; u16 q_depth; s16 cq_vector; u16 sq_tail; u16 cq_head; u16 qid; u8 cq_phase; struct irq_poll iop; }; typedef struct queue_key { char queue[24]; u64 slot; } queue_key_t; /* Completion counter context */ struct nvmeq { struct nvme_queue *q; u64 completed; }; BPF_TABLE("percpu_array", int, struct nvmeq, qarr, 1); BPF_TABLE("percpu_array", int, int, call_count, 1); STORAGE /* trace nvme interrupt */ int trace_interrupt_start(struct pt_regs *ctx, int irq, void *data) { __CALL__COUNT__FILTER__ struct nvmeq q ={}; int index = 0; q.q = data; q.completed = 0; /* reset completions */ qarr.update(&index, &q); return 0; } /* count completed on each irqpoll end */ int trace_irqpoll_end(struct pt_regs *ctx) { __CALL__COUNT__FILTER__ struct nvmeq zero = {}; int index = 0; struct nvmeq *q; int completed = ctx->ax; q = qarr.lookup_or_init(&index, &zero); if (q == NULL) { goto out; } q->completed += completed; /* No variables in kretprobe :( 64 is our budget */ if (completed < 64) { /* store as histogram */ STORE q->completed = 0; } out: return 0; } """ call_count_filter = """ { int zero = 0; int index =0; int *skip; skip = call_count.lookup_or_init(&index, &zero); if ((*skip) < %d) { (*skip)++; return 0; } (*skip) = 0; } """ # code substitutions if args.queues: bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist, queue_key_t);') bpf_text = bpf_text.replace('STORE', 'queue_key_t key = {.slot = q->completed}; ' + 'bpf_probe_read(&key.queue, sizeof(key.queue), ' + 'q->q->irqname); dist.increment(key);') else: bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);') bpf_text = bpf_text.replace('STORE', 'dist.increment(q->completed);') bpf_text = bpf_text.replace("__CALL__COUNT__FILTER__", call_count_filter % (args.freq - 1) if args.freq is not None else "") if debug: print(bpf_text) # load BPF program b = BPF(text=bpf_text) b.attach_kprobe(event="nvme_irq", fn_name="trace_interrupt_start") b.attach_kretprobe(event="nvme_irqpoll_handler", fn_name="trace_irqpoll_end") print("Tracing nvme I/O interrupt/irqpoll... Hit Ctrl-C to end.") # output exiting = 0 if args.interval else 1 dist = b.get_table("dist") while (1): try: sleep(int(args.interval)) except KeyboardInterrupt: exiting = 1 print() if args.timestamp: print("%-8s\n" % strftime("%H:%M:%S"), end="") dist.print_linear_hist("completed", "queue") dist.clear() countdown -= 1 if exiting or countdown == 0: exit() [-- Attachment #3: nvme-trace-irq --] [-- Type: text/plain, Size: 6397 bytes --] #!/usr/bin/python3 # @lint-avoid-python-3-compatibility-imports from __future__ import print_function from bcc import BPF from time import sleep, strftime import argparse # arguments examples = """examples: ./nvmetrace # summarize interrupt->irqpoll latency as a histogram ./nvmetrace 1 10 # print 1 second summaries, 10 times ./nvmetrace -mT 1 # 1s summaries, milliseconds, and timestamps ./nvmetrace -Q # show each nvme queue device separately """ parser = argparse.ArgumentParser( description="Summarize interrupt to softirq latency as a histogram", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=examples) parser.add_argument("-T", "--timestamp", action="store_true", help="include timestamp on output") parser.add_argument("-m", "--milliseconds", action="store_true", help="millisecond histogram") parser.add_argument("-Q", "--queues", action="store_true", help="print a histogram per queue") parser.add_argument("--freq", help="Account every N-th request", type=int, required=False) parser.add_argument("interval", nargs="?", default=2, help="output interval, in seconds") parser.add_argument("count", nargs="?", default=99999999, help="number of outputs") args = parser.parse_args() countdown = int(args.count) debug = 0 # define BPF program bpf_text = """ #include <uapi/linux/ptrace.h> /***************************************************************** * Nasty hack because we don't have the nvme-pci structs exported *****************************************************************/ #include <linux/aer.h> #include <linux/bitops.h> #include <linux/blkdev.h> #include <linux/blk-mq.h> #include <linux/blk-mq-pci.h> #include <linux/cpu.h> #include <linux/delay.h> #include <linux/errno.h> #include <linux/fs.h> #include <linux/genhd.h> #include <linux/hdreg.h> #include <linux/idr.h> #include <linux/init.h> #include <linux/interrupt.h> #include <linux/io.h> #include <linux/kdev_t.h> #include <linux/kernel.h> #include <linux/mm.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/mutex.h> #include <linux/pci.h> #include <linux/poison.h> #include <linux/ptrace.h> #include <linux/sched.h> #include <linux/slab.h> #include <linux/t10-pi.h> #include <linux/timer.h> #include <linux/types.h> #include <linux/io-64-nonatomic-lo-hi.h> #include <asm/unaligned.h> #include <linux/irq_poll.h> /* location of nvme.h */ #include "/mnt/linux/drivers/nvme/host/nvme.h" struct nvme_dev; struct nvme_queue; /* * Represents an NVM Express device. Each nvme_dev is a PCI function. */ struct nvme_dev { struct nvme_queue **queues; struct blk_mq_tag_set tagset; struct blk_mq_tag_set admin_tagset; u32 __iomem *dbs; struct device *dev; struct dma_pool *prp_page_pool; struct dma_pool *prp_small_pool; unsigned queue_count; unsigned online_queues; unsigned max_qid; int q_depth; u32 db_stride; void __iomem *bar; struct work_struct reset_work; struct work_struct remove_work; struct timer_list watchdog_timer; struct mutex shutdown_lock; bool subsystem; void __iomem *cmb; dma_addr_t cmb_dma_addr; u64 cmb_size; u32 cmbsz; u32 cmbloc; struct nvme_ctrl ctrl; struct completion ioq_wait; }; /* * An NVM Express queue. Each device has at least two (one for admin * commands and one for I/O commands). */ struct nvme_queue { struct device *q_dmadev; struct nvme_dev *dev; char irqname[24]; spinlock_t sq_lock; spinlock_t cq_lock; struct nvme_command *sq_cmds; struct nvme_command __iomem *sq_cmds_io; volatile struct nvme_completion *cqes; struct blk_mq_tags **tags; dma_addr_t sq_dma_addr; dma_addr_t cq_dma_addr; u32 __iomem *q_db; u16 q_depth; s16 cq_vector; u16 sq_tail; u16 cq_head; u16 qid; u8 cq_phase; struct irq_poll iop; }; typedef struct queue_key { char queue[24]; u64 slot; } queue_key_t; BPF_HASH(start, struct nvme_queue *); BPF_TABLE("percpu_array", int, int, call_count, 1); STORAGE /* timestamp nvme interrupt */ int trace_interrupt_start(struct pt_regs *ctx, int irq, void *data) { __CALL__COUNT__FILTER__ struct nvme_queue *q = data; u64 ts = bpf_ktime_get_ns(); start.update(&q, &ts); return 0; } /* timestamp nvme irqpoll */ int trace_irqpoll_start(struct pt_regs *ctx, struct irq_poll *iop, int budget) { struct nvme_queue *q = container_of(iop, struct nvme_queue, iop); u64 *tsp, delta; /* fetch timestamp and calculate delta */ tsp = start.lookup(&q); if (tsp == 0) { return 0; /* missed issue */ } delta = bpf_ktime_get_ns() - *tsp; FACTOR /* store as histogram */ STORE start.delete(&q); return 0; } """ # code substitutions call_count_filter = """ { int zero = 0; int index =0; int *skip; skip = call_count.lookup_or_init(&index, &zero); if ((*skip) < %d) { (*skip)++; return 0; } (*skip) = 0; } """ if args.milliseconds: bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000000;') label = "msecs" else: bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000;') label = "usecs" if args.queues: bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist, queue_key_t);') bpf_text = bpf_text.replace('STORE', 'queue_key_t key = {.slot = bpf_log2l(delta)}; ' + 'bpf_probe_read(&key.queue, sizeof(key.queue), ' + 'q->irqname); dist.increment(key);') else: bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);') bpf_text = bpf_text.replace('STORE', 'dist.increment(bpf_log2l(delta));') bpf_text = bpf_text.replace("__CALL__COUNT__FILTER__", call_count_filter % (args.freq - 1) if args.freq is not None else "") if debug: print(bpf_text) # load BPF program b = BPF(text=bpf_text) b.attach_kprobe(event="nvme_irq", fn_name="trace_interrupt_start") b.attach_kprobe(event="nvme_irqpoll_handler", fn_name="trace_irqpoll_start") print("Tracing nvme I/O interrupt/irqpoll... Hit Ctrl-C to end.") # output exiting = 0 if args.interval else 1 dist = b.get_table("dist") while (1): try: sleep(int(args.interval)) except KeyboardInterrupt: exiting = 1 print() if args.timestamp: print("%-8s\n" % strftime("%H:%M:%S"), end="") dist.print_log2_hist(label, "queue") dist.clear() countdown -= 1 if exiting or countdown == 0: exit() ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-17 15:38 ` Sagi Grimberg @ 2017-01-17 15:45 ` Sagi Grimberg 2017-01-20 12:22 ` Johannes Thumshirn 2017-01-17 16:15 ` Sagi Grimberg 2017-01-17 16:44 ` Andrey Kuzmin 2 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2017-01-17 15:45 UTC (permalink / raw) To: Jens Axboe, Johannes Thumshirn, lsf-pc@lists.linux-foundation.org Cc: linux-block, Linux-scsi, linux-nvme, Christoph Hellwig, Keith Busch > -- > [1] > queue = b'nvme0q1' > usecs : count distribution > 0 -> 1 : 7310 |****************************************| > 2 -> 3 : 11 | | > 4 -> 7 : 10 | | > 8 -> 15 : 20 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 1 | | > > [2] > queue = b'nvme0q1' > usecs : count distribution > 0 -> 1 : 7309 |****************************************| > 2 -> 3 : 14 | | > 4 -> 7 : 7 | | > 8 -> 15 : 17 | | > Rrr, email made the histograms look funky (tabs vs. spaces...) The count is what's important anyways... Just adding that I used an Intel P3500 nvme device. > We can see that most of the time our latency is pretty good (<1ns) but with > huge tail latencies (some 8-15 ns and even one in 32-63 ns). Obviously is micro-seconds and not nano-seconds (I wish...) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-17 15:45 ` Sagi Grimberg @ 2017-01-20 12:22 ` Johannes Thumshirn 0 siblings, 0 replies; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-20 12:22 UTC (permalink / raw) To: Sagi Grimberg Cc: Jens Axboe, lsf-pc@lists.linux-foundation.org, linux-block, Linux-scsi, linux-nvme, Christoph Hellwig, Keith Busch On Tue, Jan 17, 2017 at 05:45:53PM +0200, Sagi Grimberg wrote: > > >-- > >[1] > >queue = b'nvme0q1' > > usecs : count distribution > > 0 -> 1 : 7310 |****************************************| > > 2 -> 3 : 11 | | > > 4 -> 7 : 10 | | > > 8 -> 15 : 20 | | > > 16 -> 31 : 0 | | > > 32 -> 63 : 0 | | > > 64 -> 127 : 1 | | > > > >[2] > >queue = b'nvme0q1' > > usecs : count distribution > > 0 -> 1 : 7309 |****************************************| > > 2 -> 3 : 14 | | > > 4 -> 7 : 7 | | > > 8 -> 15 : 17 | | > > > > Rrr, email made the histograms look funky (tabs vs. spaces...) > The count is what's important anyways... > > Just adding that I used an Intel P3500 nvme device. > > >We can see that most of the time our latency is pretty good (<1ns) but with > >huge tail latencies (some 8-15 ns and even one in 32-63 ns). > > Obviously is micro-seconds and not nano-seconds (I wish...) So to share yesterday's (and today's) findings: On AHCI I see only one completion polled as well. This probably is because in contrast to networking (with NAPI) in the block layer we do have a link between submission and completion whereas in networking RX and TX are decoupled. So if we're sending out one request we get the completion for it. What we'd need is a link to know "we've sent 10 requests out, now poll for the 10 completions after the 1st IRQ". So basically what NVMe already did with calling __nvme_process_cq() after submission. Maybe we should even disable IRQs when submitting and re-enable after submitting so the submission patch doesn't get preempted by a completion. Does this make sense? Byte, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-17 15:38 ` Sagi Grimberg 2017-01-17 15:45 ` Sagi Grimberg @ 2017-01-17 16:15 ` Sagi Grimberg 2017-01-17 16:27 ` Johannes Thumshirn 2017-01-17 16:44 ` Andrey Kuzmin 2 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2017-01-17 16:15 UTC (permalink / raw) To: Jens Axboe, Johannes Thumshirn, lsf-pc@lists.linux-foundation.org Cc: linux-block, Linux-scsi, linux-nvme, Christoph Hellwig, Keith Busch Oh, and the current code that was tested can be found at: git://git.infradead.org/nvme.git nvme-irqpoll ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-17 16:15 ` Sagi Grimberg @ 2017-01-17 16:27 ` Johannes Thumshirn 2017-01-17 16:38 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-17 16:27 UTC (permalink / raw) To: Sagi Grimberg Cc: Jens Axboe, lsf-pc@lists.linux-foundation.org, linux-block, Linux-scsi, linux-nvme, Christoph Hellwig, Keith Busch On Tue, Jan 17, 2017 at 06:15:43PM +0200, Sagi Grimberg wrote: > Oh, and the current code that was tested can be found at: > > git://git.infradead.org/nvme.git nvme-irqpoll Just for the record, all tests you've run are with the upper irq_poll_budget of 256 [1]? We (Hannes and me) recently stumbed accross this when trying to poll for more than 256 queue entries in the drivers we've been testing. Did your system load reduce with irq polling? In theory it should but I have seen increases with AHCI at least according to fio. IIRC Hannes saw decreases with his SAS HBA tests, as expected. [1] lib/irq_poll.c:13 Byte, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-17 16:27 ` Johannes Thumshirn @ 2017-01-17 16:38 ` Sagi Grimberg 2017-01-18 13:51 ` Johannes Thumshirn 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2017-01-17 16:38 UTC (permalink / raw) To: Johannes Thumshirn Cc: Jens Axboe, lsf-pc@lists.linux-foundation.org, linux-block, Linux-scsi, linux-nvme, Christoph Hellwig, Keith Busch > Just for the record, all tests you've run are with the upper irq_poll_budget of > 256 [1]? Yes, but that's the point, I never ever reach this budget because I'm only processing 1-2 completions per interrupt. > We (Hannes and me) recently stumbed accross this when trying to poll for more > than 256 queue entries in the drivers we've been testing. What do you mean by stumbed? irq-poll should be agnostic to the fact that drivers can poll more than their given budget? > Did your system load reduce with irq polling? In theory it should but I have > seen increases with AHCI at least according to fio. IIRC Hannes saw decreases > with his SAS HBA tests, as expected. I didn't see any reduction. When I tested on a single cpu core (to simplify for a single queue) the cpu was at 100% cpu but got less iops (which makes sense, a single cpu-core is not enough to max out the nvme device, at least not the core I'm using). Before irqpoll I got ~230 KIOPs on a single cpu-core and after irqpoll I got ~205 KIOPs which is consistent with the ~10% iops decrease I've reported in the original submission. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-17 16:38 ` Sagi Grimberg @ 2017-01-18 13:51 ` Johannes Thumshirn 2017-01-18 14:27 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-18 13:51 UTC (permalink / raw) To: Sagi Grimberg Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme, linux-block, Keith Busch, lsf-pc@lists.linux-foundation.org On Tue, Jan 17, 2017 at 06:38:43PM +0200, Sagi Grimberg wrote: > > >Just for the record, all tests you've run are with the upper irq_poll_budget of > >256 [1]? > > Yes, but that's the point, I never ever reach this budget because > I'm only processing 1-2 completions per interrupt. > > >We (Hannes and me) recently stumbed accross this when trying to poll for more > >than 256 queue entries in the drivers we've been testing. s/stumbed/stumbled/ > > What do you mean by stumbed? irq-poll should be agnostic to the fact > that drivers can poll more than their given budget? So what you say is you saw a consomed == 1 [1] most of the time? [1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836 -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-18 13:51 ` Johannes Thumshirn @ 2017-01-18 14:27 ` Sagi Grimberg 2017-01-18 14:36 ` Andrey Kuzmin 2017-01-18 14:58 ` Johannes Thumshirn 0 siblings, 2 replies; 50+ messages in thread From: Sagi Grimberg @ 2017-01-18 14:27 UTC (permalink / raw) To: Johannes Thumshirn Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme, linux-block, Keith Busch, lsf-pc@lists.linux-foundation.org > So what you say is you saw a consomed == 1 [1] most of the time? > > [1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836 Exactly. By processing 1 completion per interrupt it makes perfect sense why this performs poorly, it's not worth paying the soft-irq schedule for only a single completion. What I'm curious is how consistent is this with different devices (wish I had some...) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-18 14:27 ` Sagi Grimberg @ 2017-01-18 14:36 ` Andrey Kuzmin 2017-01-18 14:40 ` Sagi Grimberg 2017-01-18 14:58 ` Johannes Thumshirn 1 sibling, 1 reply; 50+ messages in thread From: Andrey Kuzmin @ 2017-01-18 14:36 UTC (permalink / raw) To: Sagi Grimberg Cc: Johannes Thumshirn, Jens Axboe, Keith Busch, Linux-scsi, linux-nvme@lists.infradead.org, Christoph Hellwig, linux-block, lsf-pc@lists.linux-foundation.org On Wed, Jan 18, 2017 at 5:27 PM, Sagi Grimberg <sagi@grimberg.me> wrote: > >> So what you say is you saw a consomed == 1 [1] most of the time? >> >> [1] from >> http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836 > > > Exactly. By processing 1 completion per interrupt it makes perfect sense > why this performs poorly, it's not worth paying the soft-irq schedule > for only a single completion. Your report provided this stats with one-completion dominance for the single-threaded case. Does it also hold if you run multiple fio threads per core? Regards, Andrey > > What I'm curious is how consistent is this with different devices (wish > I had some...) > > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-18 14:36 ` Andrey Kuzmin @ 2017-01-18 14:40 ` Sagi Grimberg 2017-01-18 15:35 ` Andrey Kuzmin 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2017-01-18 14:40 UTC (permalink / raw) To: Andrey Kuzmin Cc: Johannes Thumshirn, Jens Axboe, Keith Busch, Linux-scsi, linux-nvme@lists.infradead.org, Christoph Hellwig, linux-block, lsf-pc@lists.linux-foundation.org > Your report provided this stats with one-completion dominance for the > single-threaded case. Does it also hold if you run multiple fio > threads per core? It's useless to run more threads on that core, it's already fully utilized. That single threads is already posting a fair amount of submissions, so I don't see how adding more fio jobs can help in any way. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-18 14:40 ` Sagi Grimberg @ 2017-01-18 15:35 ` Andrey Kuzmin 0 siblings, 0 replies; 50+ messages in thread From: Andrey Kuzmin @ 2017-01-18 15:35 UTC (permalink / raw) To: Sagi Grimberg Cc: Johannes Thumshirn, Jens Axboe, Keith Busch, Linux-scsi, linux-nvme@lists.infradead.org, Christoph Hellwig, linux-block, lsf-pc@lists.linux-foundation.org On Wed, Jan 18, 2017 at 5:40 PM, Sagi Grimberg <sagi@grimberg.me> wrote: > >> Your report provided this stats with one-completion dominance for the >> single-threaded case. Does it also hold if you run multiple fio >> threads per core? > > > It's useless to run more threads on that core, it's already fully > utilized. That single threads is already posting a fair amount of > submissions, so I don't see how adding more fio jobs can help in any > way. With a single thread, your completion processing/submission is completely serialized. From my experience, it takes fio couple of microsec to process completion and submit next request, and that' (much) larger than your interrupt processing time. Regards, Andrey ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-18 14:27 ` Sagi Grimberg 2017-01-18 14:36 ` Andrey Kuzmin @ 2017-01-18 14:58 ` Johannes Thumshirn 2017-01-18 15:14 ` Sagi Grimberg 1 sibling, 1 reply; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-18 14:58 UTC (permalink / raw) To: Sagi Grimberg Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme, linux-block, Keith Busch, lsf-pc@lists.linux-foundation.org On Wed, Jan 18, 2017 at 04:27:24PM +0200, Sagi Grimberg wrote: > > >So what you say is you saw a consomed == 1 [1] most of the time? > > > >[1] from http://git.infradead.org/nvme.git/commitdiff/eed5a9d925c59e43980047059fde29e3aa0b7836 > > Exactly. By processing 1 completion per interrupt it makes perfect sense > why this performs poorly, it's not worth paying the soft-irq schedule > for only a single completion. > > What I'm curious is how consistent is this with different devices (wish > I had some...) Hannes just spotted this: static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { [...] __nvme_submit_cmd(nvmeq, &cmnd); nvme_process_cq(nvmeq); spin_unlock_irq(&nvmeq->q_lock); return BLK_MQ_RQ_QUEUE_OK; out_cleanup_iod: nvme_free_iod(dev, req); out_free_cmd: nvme_cleanup_cmd(req); return ret; } So we're draining the CQ on submit. This of cause makes polling for completions in the IRQ handler rather pointless as we already did in the submission path. -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-18 14:58 ` Johannes Thumshirn @ 2017-01-18 15:14 ` Sagi Grimberg 2017-01-18 15:16 ` Johannes Thumshirn 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2017-01-18 15:14 UTC (permalink / raw) To: Johannes Thumshirn Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme, linux-block, Keith Busch, lsf-pc@lists.linux-foundation.org > Hannes just spotted this: > static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, > const struct blk_mq_queue_data *bd) > { > [...] > __nvme_submit_cmd(nvmeq, &cmnd); > nvme_process_cq(nvmeq); > spin_unlock_irq(&nvmeq->q_lock); > return BLK_MQ_RQ_QUEUE_OK; > out_cleanup_iod: > nvme_free_iod(dev, req); > out_free_cmd: > nvme_cleanup_cmd(req); > return ret; > } > > So we're draining the CQ on submit. This of cause makes polling for > completions in the IRQ handler rather pointless as we already did in the > submission path. I think you missed: http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-18 15:14 ` Sagi Grimberg @ 2017-01-18 15:16 ` Johannes Thumshirn 2017-01-18 15:39 ` Hannes Reinecke 0 siblings, 1 reply; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-18 15:16 UTC (permalink / raw) To: Sagi Grimberg Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme, linux-block, Keith Busch, lsf-pc@lists.linux-foundation.org On Wed, Jan 18, 2017 at 05:14:36PM +0200, Sagi Grimberg wrote: > > >Hannes just spotted this: > >static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, > > const struct blk_mq_queue_data *bd) > >{ > >[...] > > __nvme_submit_cmd(nvmeq, &cmnd); > > nvme_process_cq(nvmeq); > > spin_unlock_irq(&nvmeq->q_lock); > > return BLK_MQ_RQ_QUEUE_OK; > >out_cleanup_iod: > > nvme_free_iod(dev, req); > >out_free_cmd: > > nvme_cleanup_cmd(req); > > return ret; > >} > > > >So we're draining the CQ on submit. This of cause makes polling for > >completions in the IRQ handler rather pointless as we already did in the > >submission path. > > I think you missed: > http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007 I indeed did, thanks. -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-18 15:16 ` Johannes Thumshirn @ 2017-01-18 15:39 ` Hannes Reinecke 2017-01-19 8:12 ` Sagi Grimberg 0 siblings, 1 reply; 50+ messages in thread From: Hannes Reinecke @ 2017-01-18 15:39 UTC (permalink / raw) To: Johannes Thumshirn, Sagi Grimberg Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme, linux-block, Keith Busch, lsf-pc@lists.linux-foundation.org On 01/18/2017 04:16 PM, Johannes Thumshirn wrote: > On Wed, Jan 18, 2017 at 05:14:36PM +0200, Sagi Grimberg wrote: >> >>> Hannes just spotted this: >>> static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, >>> const struct blk_mq_queue_data *bd) >>> { >>> [...] >>> __nvme_submit_cmd(nvmeq, &cmnd); >>> nvme_process_cq(nvmeq); >>> spin_unlock_irq(&nvmeq->q_lock); >>> return BLK_MQ_RQ_QUEUE_OK; >>> out_cleanup_iod: >>> nvme_free_iod(dev, req); >>> out_free_cmd: >>> nvme_cleanup_cmd(req); >>> return ret; >>> } >>> >>> So we're draining the CQ on submit. This of cause makes polling for >>> completions in the IRQ handler rather pointless as we already did in the >>> submission path. >> >> I think you missed: >> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007 > > I indeed did, thanks. > But it doesn't help. We're still having to wait for the first interrupt, and if we're really fast that's the only completion we have to process. Try this: diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b4b32e6..e2dd9e2 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, } __nvme_submit_cmd(nvmeq, &cmnd); spin_unlock(&nvmeq->sq_lock); + disable_irq_nosync(nvmeq_irq(irq)); + irq_poll_sched(&nvmeq->iop); return BLK_MQ_RQ_QUEUE_OK; out_cleanup_iod: nvme_free_iod(dev, req); That should avoid the first interrupt, and with a bit of lock reduce the number of interrupts _drastically_. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N�rnberg) ^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-18 15:39 ` Hannes Reinecke @ 2017-01-19 8:12 ` Sagi Grimberg 2017-01-19 8:23 ` Sagi Grimberg 2017-01-19 9:13 ` Johannes Thumshirn 0 siblings, 2 replies; 50+ messages in thread From: Sagi Grimberg @ 2017-01-19 8:12 UTC (permalink / raw) To: Hannes Reinecke, Johannes Thumshirn Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme, linux-block, Keith Busch, lsf-pc@lists.linux-foundation.org >>> I think you missed: >>> http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007 >> >> I indeed did, thanks. >> > But it doesn't help. > > We're still having to wait for the first interrupt, and if we're really > fast that's the only completion we have to process. > > Try this: > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index b4b32e6..e2dd9e2 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, > } > __nvme_submit_cmd(nvmeq, &cmnd); > spin_unlock(&nvmeq->sq_lock); > + disable_irq_nosync(nvmeq_irq(irq)); > + irq_poll_sched(&nvmeq->iop); a. This would trigger a condition that we disable irq twice which is wrong at least because it will generate a warning. b. This would cause a way-too-much triggers of ksoftirqd. In order for it to be effective we need to to run only when it should and optimally when the completion queue has a batch of completions waiting. After a deeper analysis, I agree with Bart that interrupt coalescing is needed for it to work. The problem with nvme coalescing as Jens said, is a death penalty of 100us granularity. Hannes, Johannes, how does it look like with the devices you are testing with? Also, I think that adaptive moderation is needed in order for it to work well. I know that some networking drivers implemented adaptive moderation in SW before having HW support for it. It can be done by maintaining stats and having a periodic work that looks at it and changes the moderation parameters. Does anyone think that this is something we should consider? ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-19 8:12 ` Sagi Grimberg @ 2017-01-19 8:23 ` Sagi Grimberg 2017-01-19 9:18 ` Johannes Thumshirn 2017-01-19 9:13 ` Johannes Thumshirn 1 sibling, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2017-01-19 8:23 UTC (permalink / raw) To: Hannes Reinecke, Johannes Thumshirn Cc: Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme, linux-block, Keith Busch, lsf-pc@lists.linux-foundation.org Christoph suggest to me once that we can take a hybrid approach where we consume a small amount of completions (say 4) right away from the interrupt handler and if we have more we schedule irq-poll to reap the rest. But back then it didn't work better which is not aligned with my observations that we consume only 1 completion per interrupt... I can give it another go... What do people think about it? ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-19 8:23 ` Sagi Grimberg @ 2017-01-19 9:18 ` Johannes Thumshirn 0 siblings, 0 replies; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-19 9:18 UTC (permalink / raw) To: Sagi Grimberg Cc: Hannes Reinecke, Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme, linux-block, Keith Busch, lsf-pc@lists.linux-foundation.org On Thu, Jan 19, 2017 at 10:23:28AM +0200, Sagi Grimberg wrote: > Christoph suggest to me once that we can take a hybrid > approach where we consume a small amount of completions (say 4) > right away from the interrupt handler and if we have more > we schedule irq-poll to reap the rest. But back then it > didn't work better which is not aligned with my observations > that we consume only 1 completion per interrupt... > > I can give it another go... What do people think about it? This could be good. What's also possible (see answer to my previous mail) is measuring the time it takes for a completion to arrive and if the average time is lower than the context switch time just busy loop insted of waiting for the IRQ to arrive. If it is higher we can always schedule a timer to hit _before_ the IRQ will likely arrive and start polling. Is this something that sounds reasonable to you guys as well? Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-19 8:12 ` Sagi Grimberg 2017-01-19 8:23 ` Sagi Grimberg @ 2017-01-19 9:13 ` Johannes Thumshirn 1 sibling, 0 replies; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-19 9:13 UTC (permalink / raw) To: Sagi Grimberg Cc: Hannes Reinecke, Jens Axboe, Christoph Hellwig, Linux-scsi, linux-nvme, linux-block, Keith Busch, lsf-pc@lists.linux-foundation.org On Thu, Jan 19, 2017 at 10:12:17AM +0200, Sagi Grimberg wrote: > > >>>I think you missed: > >>>http://git.infradead.org/nvme.git/commit/49c91e3e09dc3c9dd1718df85112a8cce3ab7007 > >> > >>I indeed did, thanks. > >> > >But it doesn't help. > > > >We're still having to wait for the first interrupt, and if we're really > >fast that's the only completion we have to process. > > > >Try this: > > > > > >diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > >index b4b32e6..e2dd9e2 100644 > >--- a/drivers/nvme/host/pci.c > >+++ b/drivers/nvme/host/pci.c > >@@ -623,6 +623,8 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx, > > } > > __nvme_submit_cmd(nvmeq, &cmnd); > > spin_unlock(&nvmeq->sq_lock); > >+ disable_irq_nosync(nvmeq_irq(irq)); > >+ irq_poll_sched(&nvmeq->iop); > > a. This would trigger a condition that we disable irq twice which > is wrong at least because it will generate a warning. > > b. This would cause a way-too-much triggers of ksoftirqd. In order for > it to be effective we need to to run only when it should and optimally > when the completion queue has a batch of completions waiting. > > After a deeper analysis, I agree with Bart that interrupt coalescing is > needed for it to work. The problem with nvme coalescing as Jens said, is > a death penalty of 100us granularity. Hannes, Johannes, how does it look > like with the devices you are testing with? I haven't had a look at AHCI's Command Completion Coalescing yet but hopefully I find the time today (+SSD testing!!!). Don't know if Hannes did (but I _think_ no). The problem is we've already maxed out our test HW w/o irq_poll and so the only changes we're seeing currently is an increase of wasted CPU cycles. Not what we wanted to have. > > Also, I think that adaptive moderation is needed in order for it to > work well. I know that some networking drivers implemented adaptive > moderation in SW before having HW support for it. It can be done by > maintaining stats and having a periodic work that looks at it and > changes the moderation parameters. > > Does anyone think that this is something we should consider? Yes we've been discussing this internally as well and it sounds good but thats still all pure theory and nothing actually implemented and tested. Byte, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-17 15:38 ` Sagi Grimberg 2017-01-17 15:45 ` Sagi Grimberg 2017-01-17 16:15 ` Sagi Grimberg @ 2017-01-17 16:44 ` Andrey Kuzmin 2017-01-17 16:50 ` Sagi Grimberg 2 siblings, 1 reply; 50+ messages in thread From: Andrey Kuzmin @ 2017-01-17 16:44 UTC (permalink / raw) To: Sagi Grimberg Cc: Jens Axboe, Johannes Thumshirn, lsf-pc@lists.linux-foundation.org, linux-block, Christoph Hellwig, Keith Busch, linux-nvme@lists.infradead.org, Linux-scsi [-- Attachment #1: Type: text/plain, Size: 6311 bytes --] On Tue, Jan 17, 2017 at 6:38 PM, Sagi Grimberg <sagi@grimberg.me> wrote: > Hey, so I made some initial analysis of whats going on with > irq-poll. > > First, I sampled how much time it takes before we > get the interrupt in nvme_irq and the initial visit > to nvme_irqpoll_handler. I ran a single threaded fio > with QD=32 of 4K reads. This is two displays of a > histogram of the latency (ns): > -- > [1] > queue = b'nvme0q1' > usecs : count distribution > 0 -> 1 : 7310 |****************************************| > 2 -> 3 : 11 | | > 4 -> 7 : 10 | | > 8 -> 15 : 20 | | > 16 -> 31 : 0 | | > 32 -> 63 : 0 | | > 64 -> 127 : 1 | | > > [2] > queue = b'nvme0q1' > usecs : count distribution > 0 -> 1 : 7309 |****************************************| > 2 -> 3 : 14 | | > 4 -> 7 : 7 | | > 8 -> 15 : 17 | | > > We can see that most of the time our latency is pretty good (<1ns) but with > huge tail latencies (some 8-15 ns and even one in 32-63 ns). > **NOTE, in order to reduce the tracing impact on performance I sampled > for every 100 interrupts. > > I also sampled for a multiple threads/queues with QD=32 of 4K reads. > This is a collection of histograms for 5 queues (5 fio threads): > queue = b'nvme0q1' > usecs : count distribution > 0 -> 1 : 701 |****************************************| > 2 -> 3 : 177 |********** | > 4 -> 7 : 56 |*** | > 8 -> 15 : 24 |* | > 16 -> 31 : 6 | | > 32 -> 63 : 1 | | > > queue = b'nvme0q2' > usecs : count distribution > 0 -> 1 : 412 |****************************************| > 2 -> 3 : 52 |***** | > 4 -> 7 : 19 |* | > 8 -> 15 : 13 |* | > 16 -> 31 : 5 | | > > queue = b'nvme0q3' > usecs : count distribution > 0 -> 1 : 381 |****************************************| > 2 -> 3 : 74 |******* | > 4 -> 7 : 26 |** | > 8 -> 15 : 12 |* | > 16 -> 31 : 3 | | > 32 -> 63 : 0 | | > 64 -> 127 : 0 | | > 128 -> 255 : 1 | | > > queue = b'nvme0q4' > usecs : count distribution > 0 -> 1 : 386 |****************************************| > 2 -> 3 : 63 |****** | > 4 -> 7 : 30 |*** | > 8 -> 15 : 11 |* | > 16 -> 31 : 7 | | > 32 -> 63 : 1 | | > > queue = b'nvme0q5' > usecs : count distribution > 0 -> 1 : 384 |****************************************| > 2 -> 3 : 69 |******* | > 4 -> 7 : 25 |** | > 8 -> 15 : 15 |* | > 16 -> 31 : 3 | | > > Overall looks pretty much the same but some more samples with tails... > > Next, I sampled how many completions we are able to consume per interrupt. > Two exaples of histograms of how many completions we take per interrupt. > -- > queue = b'nvme0q1' > completed : count distribution > 0 : 0 | | > 1 : 11690 |****************************************| > 2 : 46 | | > 3 : 1 | | > > queue = b'nvme0q1' > completed : count distribution > 0 : 0 | | > 1 : 944 |****************************************| > 2 : 8 | | > -- > > So it looks like we are super not efficient because most of the times we > catch 1 > completion per interrupt and the whole point is that we need to find more! > This fio > is single threaded with QD=32 so I'd expect that we be somewhere in 8-31 > almost all > the time... I also tried QD=1024, histogram is still the same. > It looks like it takes you longer to submit an I/O than to service an interrupt, so increasing queue depth in the singe-threaded case doesn't make much difference. You might want to try multiple threads per core with QD, say, 32 (but beware that Intel limits the aggregate queue depth to 256 and even 128 for some models). Regards, Andrey > **NOTE: Here I also sampled for every 100 interrupts. > > > I'll try to run the counter on the current nvme driver and see what I get. > > > > I attached the bpf scripts I wrote (nvme-trace-irq, nvme-count-comps) > with hope that someone is interested enough to try and reproduce these > numbers on his/hers setup and maybe suggest some other useful tracing > we can do. > > Prerequisites: > 1. iovisor is needed for python bpf support. > $ echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial > xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list > $ sudo apt-get update -y > $ sudo apt-get install bcc-tools -y > # Nastty hack .. bcc only available in python2 but copliant with > python3.. > $ sudo cp -r /usr/lib/python2.7/dist-packages/bcc > /usr/lib/python3/dist-packages/ > > 2. Because we don't have the nvme-pci symbols exported, The nvme.h file is > needed on the > test machine (where the bpf code is running). I used nfs mount for the > linux source (this > is why I include from /mnt/linux in the scripts). > > > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme > > [-- Attachment #2: Type: text/html, Size: 8770 bytes --] ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-17 16:44 ` Andrey Kuzmin @ 2017-01-17 16:50 ` Sagi Grimberg 2017-01-18 14:02 ` Hannes Reinecke 0 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2017-01-17 16:50 UTC (permalink / raw) To: Andrey Kuzmin Cc: Jens Axboe, Johannes Thumshirn, lsf-pc@lists.linux-foundation.org, linux-block, Christoph Hellwig, Keith Busch, linux-nvme@lists.infradead.org, Linux-scsi > So it looks like we are super not efficient because most of the > times we catch 1 > completion per interrupt and the whole point is that we need to find > more! This fio > is single threaded with QD=32 so I'd expect that we be somewhere in > 8-31 almost all > the time... I also tried QD=1024, histogram is still the same. > > > It looks like it takes you longer to submit an I/O than to service an > interrupt, Well, with irq-poll we do practically nothing in the interrupt handler, only schedule irq-poll. Note that the latency measures are only from the point the interrupt arrives and the point we actually service it by polling for completions. > so increasing queue depth in the singe-threaded case doesn't > make much difference. You might want to try multiple threads per core > with QD, say, 32 This is how I ran, QD=32. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-17 16:50 ` Sagi Grimberg @ 2017-01-18 14:02 ` Hannes Reinecke 2017-01-20 0:13 ` Jens Axboe 0 siblings, 1 reply; 50+ messages in thread From: Hannes Reinecke @ 2017-01-18 14:02 UTC (permalink / raw) To: Sagi Grimberg, Andrey Kuzmin Cc: Jens Axboe, Johannes Thumshirn, lsf-pc@lists.linux-foundation.org, linux-block, Christoph Hellwig, Keith Busch, linux-nvme@lists.infradead.org, Linux-scsi On 01/17/2017 05:50 PM, Sagi Grimberg wrote: > >> So it looks like we are super not efficient because most of the >> times we catch 1 >> completion per interrupt and the whole point is that we need to find >> more! This fio >> is single threaded with QD=32 so I'd expect that we be somewhere in >> 8-31 almost all >> the time... I also tried QD=1024, histogram is still the same. >> >> >> It looks like it takes you longer to submit an I/O than to service an >> interrupt, > > Well, with irq-poll we do practically nothing in the interrupt handler, > only schedule irq-poll. Note that the latency measures are only from > the point the interrupt arrives and the point we actually service it > by polling for completions. > >> so increasing queue depth in the singe-threaded case doesn't >> make much difference. You might want to try multiple threads per core >> with QD, say, 32 > > This is how I ran, QD=32. The one thing which I found _really_ curious is this: IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued : total=r=7673377/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=256 (note the lines starting with 'submit' and 'complete'). They are _always_ 4, irrespective of the hardware and/or tests which I run. Jens, what are these numbers supposed to mean? Is this intended? ATM the information content from those two lines is essentially 0, seeing that the never change irrespective of the tests I'm doing. (And which fio version I'm using ...) Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-18 14:02 ` Hannes Reinecke @ 2017-01-20 0:13 ` Jens Axboe 0 siblings, 0 replies; 50+ messages in thread From: Jens Axboe @ 2017-01-20 0:13 UTC (permalink / raw) To: Hannes Reinecke, Sagi Grimberg, Andrey Kuzmin Cc: Johannes Thumshirn, lsf-pc@lists.linux-foundation.org, linux-block, Christoph Hellwig, Keith Busch, linux-nvme@lists.infradead.org, Linux-scsi On 01/18/2017 06:02 AM, Hannes Reinecke wrote: > On 01/17/2017 05:50 PM, Sagi Grimberg wrote: >> >>> So it looks like we are super not efficient because most of the >>> times we catch 1 >>> completion per interrupt and the whole point is that we need to find >>> more! This fio >>> is single threaded with QD=32 so I'd expect that we be somewhere in >>> 8-31 almost all >>> the time... I also tried QD=1024, histogram is still the same. >>> >>> >>> It looks like it takes you longer to submit an I/O than to service an >>> interrupt, >> >> Well, with irq-poll we do practically nothing in the interrupt handler, >> only schedule irq-poll. Note that the latency measures are only from >> the point the interrupt arrives and the point we actually service it >> by polling for completions. >> >>> so increasing queue depth in the singe-threaded case doesn't >>> make much difference. You might want to try multiple threads per core >>> with QD, say, 32 >> >> This is how I ran, QD=32. > > The one thing which I found _really_ curious is this: > > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >> =64=100.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> =64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >> =64=0.1% > issued : total=r=7673377/w=0/d=0, short=r=0/w=0/d=0, > drop=r=0/w=0/d=0 > latency : target=0, window=0, percentile=100.00%, depth=256 > > (note the lines starting with 'submit' and 'complete'). > They are _always_ 4, irrespective of the hardware and/or tests which I > run. Jens, what are these numbers supposed to mean? > Is this intended? It's bucketized. 0=0.0% means that 0% of the submissions didn't submit anything (unsurprisingly), and ditto for the complete side. The next bucket is 1..4, so 100% of submissions and completions was in that range. -- Jens Axboe ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 15:13 ` Jens Axboe 2017-01-12 8:23 ` Sagi Grimberg @ 2017-01-13 15:56 ` Johannes Thumshirn 1 sibling, 0 replies; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-13 15:56 UTC (permalink / raw) To: Jens Axboe Cc: lsf-pc@lists.linux-foundation.org, linux-block, Linux-scsi, Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch On Wed, Jan 11, 2017 at 08:13:02AM -0700, Jens Axboe wrote: > On 01/11/2017 08:07 AM, Jens Axboe wrote: > > On 01/11/2017 06:43 AM, Johannes Thumshirn wrote: > >> Hi all, > >> > >> I'd like to attend LSF/MM and would like to discuss polling for block drivers. > >> > >> Currently there is blk-iopoll but it is neither as widely used as NAPI in the > >> networking field and accoring to Sagi's findings in [1] performance with > >> polling is not on par with IRQ usage. > >> > >> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in > >> more block drivers and how to overcome the currently seen performance issues. > > > > It would be an interesting topic to discuss, as it is a shame that blk-iopoll > > isn't used more widely. > > Forgot to mention - it should only be a topic, if experimentation has > been done and results gathered to pin point what the issues are, so we > have something concrete to discus. I'm not at all interested in a hand > wavy discussion on the topic. So here are my 1st real numbers on this topic w/ some spinning rust: All is done with 4.10-rc3 and we at least have no performance degradation when a poll budget of 128 or 256 (oddly the max that irq_poll currently does you allow to have). Clearly it looks like the disk is the limiting factor here and we already saturated it. I'll do AHCI SSD tests on Monday. Hannes did some tests with mptXsas and a SSD maybe he can share his findings as well. scsi-sq: -------- baseline: read : io=66776KB, bw=1105.5KB/s, iops=276, runt= 60406msec write: io=65812KB, bw=1089.6KB/s, iops=272, runt= 60406msec AHCI irq_poll budget 31: read : io=53372KB, bw=904685B/s, iops=220, runt= 60411msec write: io=52596KB, bw=891531B/s, iops=217, runt= 60411msec AHCI irq_poll budget 128: read : io=66664KB, bw=1106.4KB/s, iops=276, runt= 60257msec write: io=65608KB, bw=1088.9KB/s, iops=272, runt= 60257msec AHCI irq_poll budget 256: read : io=67048KB, bw=1111.2KB/s, iops=277, runt= 60296msec write: io=65916KB, bw=1093.3KB/s, iops=273, runt= 60296msec scsi-mq: -------- baseline: read : io=78220KB, bw=1300.7KB/s, iops=325, runt= 60140msec write: io=77104KB, bw=1282.8KB/s, iops=320, runt= 60140msec AHCI irq_poll budget 256: read : io=78316KB, bw=1301.7KB/s, iops=325, runt= 60167msec write: io=77172KB, bw=1282.7KB/s, iops=320, runt= 60167msec -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 15:07 ` Jens Axboe 2017-01-11 15:13 ` Jens Axboe @ 2017-01-11 15:16 ` Hannes Reinecke 2017-01-12 4:36 ` Stephen Bates 2017-01-19 10:57 ` Ming Lei 3 siblings, 0 replies; 50+ messages in thread From: Hannes Reinecke @ 2017-01-11 15:16 UTC (permalink / raw) To: Jens Axboe, Johannes Thumshirn, lsf-pc@lists.linux-foundation.org Cc: linux-block, Linux-scsi, Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch On 01/11/2017 04:07 PM, Jens Axboe wrote: > On 01/11/2017 06:43 AM, Johannes Thumshirn wrote: >> Hi all, >> >> I'd like to attend LSF/MM and would like to discuss polling for block drivers. >> >> Currently there is blk-iopoll but it is neither as widely used as NAPI in the >> networking field and accoring to Sagi's findings in [1] performance with >> polling is not on par with IRQ usage. >> >> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in >> more block drivers and how to overcome the currently seen performance issues. > > It would be an interesting topic to discuss, as it is a shame that blk-iopoll > isn't used more widely. > Indeed; some drivers like lpfc already _have_ a polling mode, but not hooked up to blk-iopoll. Would be really cool to get that going. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N�rnberg) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 15:07 ` Jens Axboe 2017-01-11 15:13 ` Jens Axboe 2017-01-11 15:16 ` Hannes Reinecke @ 2017-01-12 4:36 ` Stephen Bates 2017-01-12 4:44 ` Jens Axboe 2017-01-19 10:57 ` Ming Lei 3 siblings, 1 reply; 50+ messages in thread From: Stephen Bates @ 2017-01-12 4:36 UTC (permalink / raw) To: Jens Axboe Cc: Johannes Thumshirn, lsf-pc@lists.linux-foundation.org, Christoph Hellwig, Sagi Grimberg, linux-scsi, linux-nvme, linux-block, Keith Busch >> >> I'd like to attend LSF/MM and would like to discuss polling for block >> drivers. >> >> Currently there is blk-iopoll but it is neither as widely used as NAPI >> in the networking field and accoring to Sagi's findings in [1] >> performance with polling is not on par with IRQ usage. >> >> On LSF/MM I'd like to whether it is desirable to have NAPI like polling >> in more block drivers and how to overcome the currently seen performance >> issues. > > It would be an interesting topic to discuss, as it is a shame that > blk-iopoll isn't used more widely. > > -- > Jens Axboe > I'd also be interested in this topic. Given that iopoll only really makes sense for low-latency, low queue depth environments (i.e. down below 10-20us) I'd like to discuss which drivers we think will need/want to be upgraded (aside from NVMe ;-)). I'd also be interested in discussing how best to enable and disable polling. In the past some of us have pushed for a "big hammer" to turn polling on for a given device or HW queue [1]. I'd like to discuss this again as well as looking at other methods above and beyond the preadv2 system call and the HIPRI flag. Stephen [1] http://marc.info/?l=linux-block&m=146307410101827&w=2 > _______________________________________________ > Linux-nvme mailing list > Linux-nvme@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-nvme > > ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-12 4:36 ` Stephen Bates @ 2017-01-12 4:44 ` Jens Axboe 2017-01-12 4:56 ` Stephen Bates 0 siblings, 1 reply; 50+ messages in thread From: Jens Axboe @ 2017-01-12 4:44 UTC (permalink / raw) To: Stephen Bates Cc: Johannes Thumshirn, lsf-pc@lists.linux-foundation.org, Christoph Hellwig, Sagi Grimberg, linux-scsi, linux-nvme, linux-block, Keith Busch On 01/11/2017 09:36 PM, Stephen Bates wrote: >>> >>> I'd like to attend LSF/MM and would like to discuss polling for block >>> drivers. >>> >>> Currently there is blk-iopoll but it is neither as widely used as NAPI >>> in the networking field and accoring to Sagi's findings in [1] >>> performance with polling is not on par with IRQ usage. >>> >>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling >>> in more block drivers and how to overcome the currently seen performance >>> issues. >> >> It would be an interesting topic to discuss, as it is a shame that >> blk-iopoll isn't used more widely. >> >> -- >> Jens Axboe >> > > I'd also be interested in this topic. Given that iopoll only really makes > sense for low-latency, low queue depth environments (i.e. down below > 10-20us) I'd like to discuss which drivers we think will need/want to be > upgraded (aside from NVMe ;-)). > > I'd also be interested in discussing how best to enable and disable > polling. In the past some of us have pushed for a "big hammer" to turn > polling on for a given device or HW queue [1]. I'd like to discuss this > again as well as looking at other methods above and beyond the preadv2 > system call and the HIPRI flag. This is a separate topic. The initial proposal is for polling for interrupt mitigation, you are talking about polling in the context of polling for completion of an IO. We can definitely talk about this form of polling as well, but it should be a separate topic and probably proposed independently. -- Jens Axboe ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-12 4:44 ` Jens Axboe @ 2017-01-12 4:56 ` Stephen Bates 0 siblings, 0 replies; 50+ messages in thread From: Stephen Bates @ 2017-01-12 4:56 UTC (permalink / raw) To: Jens Axboe Cc: Johannes Thumshirn, lsf-pc@lists.linux-foundation.org, Christoph Hellwig, Sagi Grimberg, linux-scsi, linux-nvme, linux-block, Keith Busch > > This is a separate topic. The initial proposal is for polling for > interrupt mitigation, you are talking about polling in the context of > polling for completion of an IO. > > We can definitely talk about this form of polling as well, but it should > be a separate topic and probably proposed independently. > > -- > Jens Axboe > > Jens Oh thanks for the clarification. I will propose this as a separate topic. Thanks Stephen ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 15:07 ` Jens Axboe ` (2 preceding siblings ...) 2017-01-12 4:36 ` Stephen Bates @ 2017-01-19 10:57 ` Ming Lei 2017-01-19 11:03 ` Hannes Reinecke 3 siblings, 1 reply; 50+ messages in thread From: Ming Lei @ 2017-01-19 10:57 UTC (permalink / raw) To: Jens Axboe Cc: Johannes Thumshirn, lsf-pc@lists.linux-foundation.org, linux-block, Linux SCSI List, Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch On Wed, Jan 11, 2017 at 11:07 PM, Jens Axboe <axboe@kernel.dk> wrote: > On 01/11/2017 06:43 AM, Johannes Thumshirn wrote: >> Hi all, >> >> I'd like to attend LSF/MM and would like to discuss polling for block drivers. >> >> Currently there is blk-iopoll but it is neither as widely used as NAPI in the >> networking field and accoring to Sagi's findings in [1] performance with >> polling is not on par with IRQ usage. >> >> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in >> more block drivers and how to overcome the currently seen performance issues. > > It would be an interesting topic to discuss, as it is a shame that blk-iopoll > isn't used more widely. I remembered that Keith and I discussed some issues of blk-iopoll: http://marc.info/?l=linux-block&m=147576999016407&w=2 seems which isn't addressed yet. Thanks, Ming Lei ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-19 10:57 ` Ming Lei @ 2017-01-19 11:03 ` Hannes Reinecke 0 siblings, 0 replies; 50+ messages in thread From: Hannes Reinecke @ 2017-01-19 11:03 UTC (permalink / raw) To: Ming Lei, Jens Axboe Cc: Johannes Thumshirn, lsf-pc@lists.linux-foundation.org, linux-block, Linux SCSI List, Sagi Grimberg, linux-nvme, Christoph Hellwig, Keith Busch On 01/19/2017 11:57 AM, Ming Lei wrote: > On Wed, Jan 11, 2017 at 11:07 PM, Jens Axboe <axboe@kernel.dk> wrote: >> On 01/11/2017 06:43 AM, Johannes Thumshirn wrote: >>> Hi all, >>> >>> I'd like to attend LSF/MM and would like to discuss polling for block drivers. >>> >>> Currently there is blk-iopoll but it is neither as widely used as NAPI in the >>> networking field and accoring to Sagi's findings in [1] performance with >>> polling is not on par with IRQ usage. >>> >>> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in >>> more block drivers and how to overcome the currently seen performance issues. >> >> It would be an interesting topic to discuss, as it is a shame that blk-iopoll >> isn't used more widely. > > I remembered that Keith and I discussed some issues of blk-iopoll: > > http://marc.info/?l=linux-block&m=147576999016407&w=2 > > seems which isn't addressed yet. > That's a different poll. For some obscure reasons you have a blk-mq-poll function (via q->mq_ops->poll) and an irqpoll function. The former is for polling completion of individual block-layer tags, the latter for polling completions from the hardware instead of relying on interrupts. We're discussing the latter here, so that thread isn't really applicable here. However, there have been requests to discuss the former at LSF/MM, too. So there might be a chance of restarting that discussion. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 13:43 [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers Johannes Thumshirn 2017-01-11 13:46 ` Hannes Reinecke 2017-01-11 15:07 ` Jens Axboe @ 2017-01-11 16:08 ` Bart Van Assche 2017-01-11 16:12 ` hch ` (2 more replies) 2 siblings, 3 replies; 50+ messages in thread From: Bart Van Assche @ 2017-01-11 16:08 UTC (permalink / raw) To: jthumshirn@suse.de, lsf-pc@lists.linux-foundation.org Cc: Linux-scsi@vger.kernel.org, hch@infradead.org, keith.busch@intel.com, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, sagi@grimberg.me On Wed, 2017-01-11 at 14:43 +0100, Johannes Thumshirn wrote: > I'd like to attend LSF/MM and would like to discuss polling for block > drivers. >=20 > Currently there is blk-iopoll but it is neither as widely used as NAPI in > the networking field and accoring to Sagi's findings in [1] performance > with polling is not on par with IRQ usage. >=20 > On LSF/MM I'd like to whether it is desirable to have NAPI like polling i= n > more block drivers and how to overcome the currently seen performance > issues. >=20 > [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.h= t > ml A typical Ethernet network adapter delays the generation of an interrupt after it has received a packet. A typical block device or HBA does not dela= y the generation of an interrupt that reports an I/O completion. I think that is why polling is more effective for network adapters than for block devices. I'm not sure whether it is possible to achieve benefits similar to NAPI for block devices without implementing interrupt coalescing in the block device firmware. Note: for block device implementations that use the RDMA API, the RDMA API supports interrupt coalescing (see also ib_modify_cq()). An example of the interrupt coalescing parameters for a network adapter: # ethtool -c em1 | grep -E 'rx-usecs:|tx-usecs:' rx-usecs: 3 tx-usecs: 0 Bart.= ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 16:08 ` Bart Van Assche @ 2017-01-11 16:12 ` hch 2017-01-11 16:15 ` Jens Axboe 2017-01-11 16:22 ` Hannes Reinecke 2017-01-11 16:14 ` Johannes Thumshirn 2017-01-12 8:41 ` Sagi Grimberg 2 siblings, 2 replies; 50+ messages in thread From: hch @ 2017-01-11 16:12 UTC (permalink / raw) To: Bart Van Assche Cc: jthumshirn@suse.de, lsf-pc@lists.linux-foundation.org, Linux-scsi@vger.kernel.org, hch@infradead.org, keith.busch@intel.com, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, sagi@grimberg.me On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote: > A typical Ethernet network adapter delays the generation of an interrupt > after it has received a packet. A typical block device or HBA does not delay > the generation of an interrupt that reports an I/O completion. NVMe allows for configurable interrupt coalescing, as do a few modern SCSI HBAs. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 16:12 ` hch @ 2017-01-11 16:15 ` Jens Axboe 2017-01-11 16:22 ` Hannes Reinecke 1 sibling, 0 replies; 50+ messages in thread From: Jens Axboe @ 2017-01-11 16:15 UTC (permalink / raw) To: hch@infradead.org, Bart Van Assche Cc: jthumshirn@suse.de, lsf-pc@lists.linux-foundation.org, Linux-scsi@vger.kernel.org, keith.busch@intel.com, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, sagi@grimberg.me On 01/11/2017 09:12 AM, hch@infradead.org wrote: > On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote: >> A typical Ethernet network adapter delays the generation of an interrupt >> after it has received a packet. A typical block device or HBA does not delay >> the generation of an interrupt that reports an I/O completion. > > NVMe allows for configurable interrupt coalescing, as do a few modern > SCSI HBAs. Unfortunately it's way too coarse on NVMe, with the timer being in 100 usec increments... I've had mixed success with the depth trigger. -- Jens Axboe ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 16:12 ` hch 2017-01-11 16:15 ` Jens Axboe @ 2017-01-11 16:22 ` Hannes Reinecke 2017-01-11 16:26 ` Bart Van Assche 1 sibling, 1 reply; 50+ messages in thread From: Hannes Reinecke @ 2017-01-11 16:22 UTC (permalink / raw) To: hch@infradead.org, Bart Van Assche Cc: jthumshirn@suse.de, lsf-pc@lists.linux-foundation.org, Linux-scsi@vger.kernel.org, keith.busch@intel.com, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, sagi@grimberg.me On 01/11/2017 05:12 PM, hch@infradead.org wrote: > On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote: >> A typical Ethernet network adapter delays the generation of an interrupt >> after it has received a packet. A typical block device or HBA does not delay >> the generation of an interrupt that reports an I/O completion. > > NVMe allows for configurable interrupt coalescing, as do a few modern > SCSI HBAs. Essentially every modern SCSI HBA does interrupt coalescing; otherwise the queuing interface won't work efficiently. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG N�rnberg) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 16:22 ` Hannes Reinecke @ 2017-01-11 16:26 ` Bart Van Assche 2017-01-11 16:45 ` Hannes Reinecke 2017-01-12 8:52 ` sagi grimberg 0 siblings, 2 replies; 50+ messages in thread From: Bart Van Assche @ 2017-01-11 16:26 UTC (permalink / raw) To: hch@infradead.org, hare@suse.de Cc: Linux-scsi@vger.kernel.org, keith.busch@intel.com, jthumshirn@suse.de, linux-nvme@lists.infradead.org, lsf-pc@lists.linux-foundation.org, linux-block@vger.kernel.org, sagi@grimberg.me On Wed, 2017-01-11 at 17:22 +0100, Hannes Reinecke wrote: > On 01/11/2017 05:12 PM, hch@infradead.org wrote: > > On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote: > > > A typical Ethernet network adapter delays the generation of an > > > interrupt > > > after it has received a packet. A typical block device or HBA does no= t > > > delay > > > the generation of an interrupt that reports an I/O completion. > >=20 > > NVMe allows for configurable interrupt coalescing, as do a few modern > > SCSI HBAs. >=20 > Essentially every modern SCSI HBA does interrupt coalescing; otherwise > the queuing interface won't work efficiently. Hello Hannes, The first e-mail in this e-mail thread referred to measurements against a block device for which interrupt coalescing was not enabled. I think that the measurements have to be repeated against a block device for which interrupt coalescing is enabled. Bart.= ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 16:26 ` Bart Van Assche @ 2017-01-11 16:45 ` Hannes Reinecke 2017-01-12 8:52 ` sagi grimberg 1 sibling, 0 replies; 50+ messages in thread From: Hannes Reinecke @ 2017-01-11 16:45 UTC (permalink / raw) To: Bart Van Assche, hch@infradead.org Cc: Linux-scsi@vger.kernel.org, keith.busch@intel.com, jthumshirn@suse.de, linux-nvme@lists.infradead.org, lsf-pc@lists.linux-foundation.org, linux-block@vger.kernel.org, sagi@grimberg.me On 01/11/2017 05:26 PM, Bart Van Assche wrote: > On Wed, 2017-01-11 at 17:22 +0100, Hannes Reinecke wrote: >> On 01/11/2017 05:12 PM, hch@infradead.org wrote: >>> On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote: >>>> A typical Ethernet network adapter delays the generation of an >>>> interrupt >>>> after it has received a packet. A typical block device or HBA does not >>>> delay >>>> the generation of an interrupt that reports an I/O completion. >>> >>> NVMe allows for configurable interrupt coalescing, as do a few modern >>> SCSI HBAs. >> >> Essentially every modern SCSI HBA does interrupt coalescing; otherwise >> the queuing interface won't work efficiently. > > Hello Hannes, > > The first e-mail in this e-mail thread referred to measurements against a > block device for which interrupt coalescing was not enabled. I think that > the measurements have to be repeated against a block device for which > interrupt coalescing is enabled. > Guess what we'll be doing in the next few days ... Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: J. Hawn, J. Guild, F. Imend�rffer, HRB 16746 (AG N�rnberg) ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 16:26 ` Bart Van Assche 2017-01-11 16:45 ` Hannes Reinecke @ 2017-01-12 8:52 ` sagi grimberg 1 sibling, 0 replies; 50+ messages in thread From: sagi grimberg @ 2017-01-12 8:52 UTC (permalink / raw) To: Bart Van Assche, hch@infradead.org, hare@suse.de Cc: Linux-scsi@vger.kernel.org, keith.busch@intel.com, jthumshirn@suse.de, linux-nvme@lists.infradead.org, lsf-pc@lists.linux-foundation.org, linux-block@vger.kernel.org >>>> A typical Ethernet network adapter delays the generation of an >>>> interrupt >>>> after it has received a packet. A typical block device or HBA does not >>>> delay >>>> the generation of an interrupt that reports an I/O completion. >>> >>> NVMe allows for configurable interrupt coalescing, as do a few modern >>> SCSI HBAs. >> >> Essentially every modern SCSI HBA does interrupt coalescing; otherwise >> the queuing interface won't work efficiently. > > Hello Hannes, > > The first e-mail in this e-mail thread referred to measurements against a > block device for which interrupt coalescing was not enabled. I think that > the measurements have to be repeated against a block device for which > interrupt coalescing is enabled. Hey Bart, I see how interrupt coalescing can help, but even without it, I think it should be better. Moreover, I don't think that strict moderation is something that can work. The only way interrupt moderation can be effective, is if it's adaptive and adjusts itself to the workload. Note that this feature is on by default in most of the modern Ethernet devices (adaptive-rx). IMHO, irq-poll vs. interrupt polling should be compared without relying on the underlying device capabilities. ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 16:08 ` Bart Van Assche 2017-01-11 16:12 ` hch @ 2017-01-11 16:14 ` Johannes Thumshirn 2017-01-12 8:41 ` Sagi Grimberg 2 siblings, 0 replies; 50+ messages in thread From: Johannes Thumshirn @ 2017-01-11 16:14 UTC (permalink / raw) To: Bart Van Assche Cc: lsf-pc@lists.linux-foundation.org, Linux-scsi@vger.kernel.org, hch@infradead.org, keith.busch@intel.com, linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, sagi@grimberg.me On Wed, Jan 11, 2017 at 04:08:31PM +0000, Bart Van Assche wrote: [...] > A typical Ethernet network adapter delays the generation of an interrupt > after it has received a packet. A typical block device or HBA does not delay > the generation of an interrupt that reports an I/O completion. I think that > is why polling is more effective for network adapters than for block > devices. I'm not sure whether it is possible to achieve benefits similar to > NAPI for block devices without implementing interrupt coalescing in the > block device firmware. Note: for block device implementations that use the > RDMA API, the RDMA API supports interrupt coalescing (see also > ib_modify_cq()). Well you can always turn off IRQ generation in the HBA just before scheuduling the poll handler and re-enable it after you've exhausted your budget or used too much time, can't you? I'll do some prototyping and tests tomorrow so we have some more ground for discussion. Byte, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg GF: Felix Imend�rffer, Jane Smithard, Graham Norton HRB 21284 (AG N�rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850 ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-11 16:08 ` Bart Van Assche 2017-01-11 16:12 ` hch 2017-01-11 16:14 ` Johannes Thumshirn @ 2017-01-12 8:41 ` Sagi Grimberg 2017-01-12 19:13 ` Bart Van Assche 2 siblings, 1 reply; 50+ messages in thread From: Sagi Grimberg @ 2017-01-12 8:41 UTC (permalink / raw) To: Bart Van Assche, jthumshirn@suse.de, lsf-pc@lists.linux-foundation.org Cc: hch@infradead.org, keith.busch@intel.com, linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, Linux-scsi@vger.kernel.org >> I'd like to attend LSF/MM and would like to discuss polling for block >> drivers. >> >> Currently there is blk-iopoll but it is neither as widely used as NAPI in >> the networking field and accoring to Sagi's findings in [1] performance >> with polling is not on par with IRQ usage. >> >> On LSF/MM I'd like to whether it is desirable to have NAPI like polling in >> more block drivers and how to overcome the currently seen performance >> issues. >> >> [1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.ht >> ml > > A typical Ethernet network adapter delays the generation of an interrupt > after it has received a packet. A typical block device or HBA does not delay > the generation of an interrupt that reports an I/O completion. I think that > is why polling is more effective for network adapters than for block > devices. I'm not sure whether it is possible to achieve benefits similar to > NAPI for block devices without implementing interrupt coalescing in the > block device firmware. Note: for block device implementations that use the > RDMA API, the RDMA API supports interrupt coalescing (see also > ib_modify_cq()). Hey Bart, I don't agree that interrupt coalescing is the reason why irq-poll is not suitable for nvme or storage devices. First, when the nvme device fires an interrupt, the driver consumes the completion(s) from the interrupt (usually there will be some more completions waiting in the cq by the time the host start processing it). With irq-poll, we disable further interrupts and schedule soft-irq for processing, which if at all, improve the completions per interrupt utilization (because it takes slightly longer before processing the cq). Moreover, irq-poll is budgeting the completion queue processing which is important for a couple of reasons. 1. it prevents hard-irq context abuse like we do today. if other cpu cores are pounding with more submissions on the same queue, we might get into a hard-lockup (which I've seen happening). 2. irq-poll maintains fairness between devices by correctly budgeting the processing of different completions queues that share the same affinity. This can become crucial when working with multiple nvme devices, each has multiple io queues that share the same IRQ assignment. 3. It reduces (or at least should reduce) the overall number of interrupts in the system because we only enable interrupts again when the completion queue is completely processed. So overall, I think it's very useful for nvme and other modern HBAs, but unfortunately, other than solving (1), I wasn't able to see performance improvement but rather a slight regression, but I can't explain where its coming from... _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers 2017-01-12 8:41 ` Sagi Grimberg @ 2017-01-12 19:13 ` Bart Van Assche 0 siblings, 0 replies; 50+ messages in thread From: Bart Van Assche @ 2017-01-12 19:13 UTC (permalink / raw) To: jthumshirn@suse.de, lsf-pc@lists.linux-foundation.org, sagi@grimberg.me Cc: hch@infradead.org, keith.busch@intel.com, linux-block@vger.kernel.org, linux-nvme@lists.infradead.org, Linux-scsi@vger.kernel.org On Thu, 2017-01-12 at 10:41 +0200, Sagi Grimberg wrote: > First, when the nvme device fires an interrupt, the driver consumes > the completion(s) from the interrupt (usually there will be some more > completions waiting in the cq by the time the host start processing it). > With irq-poll, we disable further interrupts and schedule soft-irq for > processing, which if at all, improve the completions per interrupt > utilization (because it takes slightly longer before processing the cq). > > Moreover, irq-poll is budgeting the completion queue processing which is > important for a couple of reasons. > > 1. it prevents hard-irq context abuse like we do today. if other cpu > cores are pounding with more submissions on the same queue, we might > get into a hard-lockup (which I've seen happening). > > 2. irq-poll maintains fairness between devices by correctly budgeting > the processing of different completions queues that share the same > affinity. This can become crucial when working with multiple nvme > devices, each has multiple io queues that share the same IRQ > assignment. > > 3. It reduces (or at least should reduce) the overall number of > interrupts in the system because we only enable interrupts again > when the completion queue is completely processed. > > So overall, I think it's very useful for nvme and other modern HBAs, > but unfortunately, other than solving (1), I wasn't able to see > performance improvement but rather a slight regression, but I can't > explain where its coming from... Hello Sagi, Thank you for the additional clarification. Although I am not sure whether irq-poll is the ideal solution for the problems that has been described above, I agree that it would help to discuss this topic further during LSF/MM. Bart. _______________________________________________ Linux-nvme mailing list Linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 50+ messages in thread
end of thread, other threads:[~2017-01-20 12:22 UTC | newest] Thread overview: 50+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-01-11 13:43 [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers Johannes Thumshirn 2017-01-11 13:46 ` Hannes Reinecke 2017-01-11 15:07 ` Jens Axboe 2017-01-11 15:13 ` Jens Axboe 2017-01-12 8:23 ` Sagi Grimberg 2017-01-12 10:02 ` Johannes Thumshirn 2017-01-12 11:44 ` Sagi Grimberg 2017-01-12 12:53 ` Johannes Thumshirn 2017-01-12 14:41 ` [Lsf-pc] " Sagi Grimberg 2017-01-12 18:59 ` Johannes Thumshirn 2017-01-17 15:38 ` Sagi Grimberg 2017-01-17 15:45 ` Sagi Grimberg 2017-01-20 12:22 ` Johannes Thumshirn 2017-01-17 16:15 ` Sagi Grimberg 2017-01-17 16:27 ` Johannes Thumshirn 2017-01-17 16:38 ` Sagi Grimberg 2017-01-18 13:51 ` Johannes Thumshirn 2017-01-18 14:27 ` Sagi Grimberg 2017-01-18 14:36 ` Andrey Kuzmin 2017-01-18 14:40 ` Sagi Grimberg 2017-01-18 15:35 ` Andrey Kuzmin 2017-01-18 14:58 ` Johannes Thumshirn 2017-01-18 15:14 ` Sagi Grimberg 2017-01-18 15:16 ` Johannes Thumshirn 2017-01-18 15:39 ` Hannes Reinecke 2017-01-19 8:12 ` Sagi Grimberg 2017-01-19 8:23 ` Sagi Grimberg 2017-01-19 9:18 ` Johannes Thumshirn 2017-01-19 9:13 ` Johannes Thumshirn 2017-01-17 16:44 ` Andrey Kuzmin 2017-01-17 16:50 ` Sagi Grimberg 2017-01-18 14:02 ` Hannes Reinecke 2017-01-20 0:13 ` Jens Axboe 2017-01-13 15:56 ` Johannes Thumshirn 2017-01-11 15:16 ` Hannes Reinecke 2017-01-12 4:36 ` Stephen Bates 2017-01-12 4:44 ` Jens Axboe 2017-01-12 4:56 ` Stephen Bates 2017-01-19 10:57 ` Ming Lei 2017-01-19 11:03 ` Hannes Reinecke 2017-01-11 16:08 ` Bart Van Assche 2017-01-11 16:12 ` hch 2017-01-11 16:15 ` Jens Axboe 2017-01-11 16:22 ` Hannes Reinecke 2017-01-11 16:26 ` Bart Van Assche 2017-01-11 16:45 ` Hannes Reinecke 2017-01-12 8:52 ` sagi grimberg 2017-01-11 16:14 ` Johannes Thumshirn 2017-01-12 8:41 ` Sagi Grimberg 2017-01-12 19:13 ` Bart Van Assche
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).