From mboxrd@z Thu Jan 1 00:00:00 1970 From: huhaiyan@huawei.com (Haiyan Hu) Date: Thu, 5 Sep 2013 16:55:43 +0800 Subject: [PATCH] NVMe: Avoid caculate cq head doorbel in nvme_process_cq() In-Reply-To: <20130904165142.GA20931@linux.intel.com> References: <20130903155316.GY4707@linux.intel.com> <5226FA47.10205@huawei.com> <20130904165142.GA20931@linux.intel.com> Message-ID: <5228470F.2040206@huawei.com> On 2013/9/5 0:51, Matthew Wilcox wrote: > On Wed, Sep 04, 2013@05:15:51PM +0800, Haiyan Hu wrote: >> On 2013/9/3 23:53, Matthew Wilcox wrote: >>> Have you been able to measure a difference? If so, can you share >>> relative numbers? >> >> Sorry for my inaccurate description. >> Our test data shows that this patch does not change cmd process latency. >> But I think it can improve code efficiency, maybe a little. > > Right, so it's a trade-off. You want to add 8 bytes to the queue data > structure in order to save a few instructions from being executed at > queue processing time. We need to quantify the savings (since we know > the costs). I approximated the code by replacing: > > - writel(head, nvmeq->q_db + (1 << nvmeq->dev->db_stride)); > + writel(head, nvmeq->q_db); > > Here's the code before: > > 2ca: 48 8b 43 08 mov 0x8(%rbx),%rax > 2ce: 48 8b 93 88 00 00 00 mov 0x88(%rbx),%rdx > 2d5: 8b 48 40 mov 0x40(%rax),%ecx > 2d8: b8 01 00 00 00 mov $0x1,%eax > 2dd: d3 e0 shl %cl,%eax > 2df: 48 98 cltq > 2e1: 48 8d 14 82 lea (%rdx,%rax,4),%rdx > 2e5: 41 0f b7 c4 movzwl %r12w,%eax > 2e9: 89 02 mov %eax,(%rdx) > > Here's the code after: > > 2ca: 48 8b 93 88 00 00 00 mov 0x88(%rbx),%rdx > 2d1: 41 0f b7 c4 movzwl %r12w,%eax > 2d5: 89 02 mov %eax,(%rdx) > > That's 6 instructions sved, and to be fair they have some pretty tight > dependencies. Still, on a 3GHz processor, that's maybe 2 nanoseconds. > If we're doing a million IOPS, we could save 2ms/s, or 0.2%. Pretty tough > to measure, let alone justify. > > Here's an alternative that doesn't require adding 8 bytes to the > nvme_queue. Instead of storing the shift in db_stride, we can store the > actual stride. Then there is less calculation to be done at > completion time, and the code looks like this: > > - writel(head, nvmeq->q_db + (1 << nvmeq->dev->db_stride)); > + writel(head, nvmeq->q_db + nvmeq->dev->db_stride); > > 2ca: 48 8b 43 08 mov 0x8(%rbx),%rax > 2ce: 48 63 50 40 movslq 0x40(%rax),%rdx > 2d2: 48 8b 83 88 00 00 00 mov 0x88(%rbx),%rax > 2d9: 48 8d 14 90 lea (%rax,%rdx,4),%rdx > 2dd: 41 0f b7 c4 movzwl %r12w,%eax > 2e1: 89 02 mov %eax,(%rdx) > > That saves half the instructions, for no increased data structure usage. > > One other microoptimisation we can do is change the type of db_stride from > int to unsigned. Then the compiler produces: > > 2ca: 48 8b 43 08 mov 0x8(%rbx),%rax > 2ce: 8b 50 40 mov 0x40(%rax),%edx > 2d1: 48 8b 83 88 00 00 00 mov 0x88(%rbx),%rax > 2d8: 48 8d 14 90 lea (%rax,%rdx,4),%rdx > 2dc: 41 0f b7 c4 movzwl %r12w,%eax > 2e0: 89 02 mov %eax,(%rdx) > > which uses one fewer byte (at address 2ce, it uses mov instead of movslq). > > > Would you like to send a patch which changes the type of db_stride to > unsigned and changes the value stored there to be 1 << the current value? > > Or you can try to persuade me again that the tradeoff is worth adding > the extra 8 bytes to the data structure :-) > . > Your analysis is more comprehensive. Of course, I'd like to send a new patch that you advised. Do you think the updated variable named "db_stride_shift"? Thank you.