From mboxrd@z Thu Jan 1 00:00:00 1970 From: willy@linux.intel.com (Matthew Wilcox) Date: Thu, 19 Jun 2014 12:59:57 -0400 Subject: [PATCH] NVMe: Remove superfluous cqe_seen In-Reply-To: <80B89753B40C5141A3E2D53FE7A2A8A9943A7ADC@NTXBOIMBX02.micron.com> References: <537CED01.6040106@micron.com> <20140521231706.GP6121@linux.intel.com> <80B89753B40C5141A3E2D53FE7A2A8A9943A7ADC@NTXBOIMBX02.micron.com> Message-ID: <20140619165957.GK12025@linux.intel.com> On Thu, May 22, 2014@12:10:19AM +0000, Sam Bradshaw (sbradshaw) wrote: > Performance problem, though not very easily measured. At very high iops > rates, most if not all cqe's are processed via nvme_process_cq() in > make_request(), leaving nvme_irq() with no work to do. Nevertheless, it > always writes cqe_seen, which invalidates a very hot cacheline. This > is somewhat exacerbated when IO submissions originate on a remote node > relative to the cpu handling the irq. I was thinking "Hey, we should move cqe_seen to a different cacheline". So I looked at the cacheline assignments for the different variables, and cqe_seen is on the same cacheline as cq_head and cq_phase, so that cacheline is already being dirtied. Indeed, it's in the same Dword as cq_phase, so I'd be amazed if the CPU didn't coalesce the two writes. That might be a more fruitful patch ... rearrange nvme_queue to put cq_head, cq_phase and cqe_seen in the same Dword, and expect the CPU to optimise the three assignments into a single Dword store. I'll let you try it out since you have the setup to benchmark it. Right now, this is the layout I see: /* --- cacheline 3 boundary (192 bytes) --- */ u32 * q_db; /* 192 8 */ u16 q_depth; /* 200 2 */ u16 cq_vector; /* 202 2 */ u16 sq_head; /* 204 2 */ u16 sq_tail; /* 206 2 */ u16 cq_head; /* 208 2 */ u16 qid; /* 210 2 */ u8 cq_phase; /* 212 1 */ u8 cqe_seen; /* 213 1 */ u8 q_suspended; /* 214 1 */ I notice a 4-byte hole after q_lock, so moving cq_head, cq_phase and cqe_seen into that space would probably be a good idea (since that cacheline is definitely dirty). I really haven't tried to optimise the frequently-updated parts of the data structure into the same cacheline, and it should really help your bizarre setup :-).