From mboxrd@z Thu Jan 1 00:00:00 1970 From: willy@linux.intel.com (Matthew Wilcox) Date: Wed, 4 Sep 2013 12:51:42 -0400 Subject: [PATCH] NVMe: Avoid caculate cq head doorbel in nvme_process_cq() In-Reply-To: <5226FA47.10205@huawei.com> References: <20130903155316.GY4707@linux.intel.com> <5226FA47.10205@huawei.com> Message-ID: <20130904165142.GA20931@linux.intel.com> On Wed, Sep 04, 2013@05:15:51PM +0800, Haiyan Hu wrote: > On 2013/9/3 23:53, Matthew Wilcox wrote: > > Have you been able to measure a difference? If so, can you share > > relative numbers? > > Sorry for my inaccurate description. > Our test data shows that this patch does not change cmd process latency. > But I think it can improve code efficiency, maybe a little. Right, so it's a trade-off. You want to add 8 bytes to the queue data structure in order to save a few instructions from being executed at queue processing time. We need to quantify the savings (since we know the costs). I approximated the code by replacing: - writel(head, nvmeq->q_db + (1 << nvmeq->dev->db_stride)); + writel(head, nvmeq->q_db); Here's the code before: 2ca: 48 8b 43 08 mov 0x8(%rbx),%rax 2ce: 48 8b 93 88 00 00 00 mov 0x88(%rbx),%rdx 2d5: 8b 48 40 mov 0x40(%rax),%ecx 2d8: b8 01 00 00 00 mov $0x1,%eax 2dd: d3 e0 shl %cl,%eax 2df: 48 98 cltq 2e1: 48 8d 14 82 lea (%rdx,%rax,4),%rdx 2e5: 41 0f b7 c4 movzwl %r12w,%eax 2e9: 89 02 mov %eax,(%rdx) Here's the code after: 2ca: 48 8b 93 88 00 00 00 mov 0x88(%rbx),%rdx 2d1: 41 0f b7 c4 movzwl %r12w,%eax 2d5: 89 02 mov %eax,(%rdx) That's 6 instructions sved, and to be fair they have some pretty tight dependencies. Still, on a 3GHz processor, that's maybe 2 nanoseconds. If we're doing a million IOPS, we could save 2ms/s, or 0.2%. Pretty tough to measure, let alone justify. Here's an alternative that doesn't require adding 8 bytes to the nvme_queue. Instead of storing the shift in db_stride, we can store the actual stride. Then there is less calculation to be done at completion time, and the code looks like this: - writel(head, nvmeq->q_db + (1 << nvmeq->dev->db_stride)); + writel(head, nvmeq->q_db + nvmeq->dev->db_stride); 2ca: 48 8b 43 08 mov 0x8(%rbx),%rax 2ce: 48 63 50 40 movslq 0x40(%rax),%rdx 2d2: 48 8b 83 88 00 00 00 mov 0x88(%rbx),%rax 2d9: 48 8d 14 90 lea (%rax,%rdx,4),%rdx 2dd: 41 0f b7 c4 movzwl %r12w,%eax 2e1: 89 02 mov %eax,(%rdx) That saves half the instructions, for no increased data structure usage. One other microoptimisation we can do is change the type of db_stride from int to unsigned. Then the compiler produces: 2ca: 48 8b 43 08 mov 0x8(%rbx),%rax 2ce: 8b 50 40 mov 0x40(%rax),%edx 2d1: 48 8b 83 88 00 00 00 mov 0x88(%rbx),%rax 2d8: 48 8d 14 90 lea (%rax,%rdx,4),%rdx 2dc: 41 0f b7 c4 movzwl %r12w,%eax 2e0: 89 02 mov %eax,(%rdx) which uses one fewer byte (at address 2ce, it uses mov instead of movslq). Would you like to send a patch which changes the type of db_stride to unsigned and changes the value stored there to be 1 << the current value? Or you can try to persuade me again that the tradeoff is worth adding the extra 8 bytes to the data structure :-)