linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* problem with discard granularity in sd
@ 2017-03-31 16:52 David Buckley
  2017-04-05  0:12 ` Martin K. Petersen
  0 siblings, 1 reply; 9+ messages in thread
From: David Buckley @ 2017-03-31 16:52 UTC (permalink / raw)
  To: linux-scsi

Hello,

Hopefully this is the right place for this, and apologies for the
lengthy mail.  I'm struggling with an issue with SCSI UNMAP/discard in
newer kernels, and I'm hoping to find a resolution or at least to
better understand why this has changed.

Some background info:
Our Linux boxes are primarily VMs running on VMware backed by NetApp
storage.  We have a fair number of systems that directly mount LUNs
(due to i/o requirements, snapshot scheduling, dedupe issues, etc.).
On newer LUNs, the 'space_alloc' option is enabled, which causes the
LUN to report unmap support and free unused blocks on the underlying
storage.

The problem:
I noticed multiple LUNs with space_alloc enabled reported 100%
utilization on the netapp but much less from the Linux. I verified
they were mounted with discard option and also ran fstrim, which
reported success but did not change the utilization reported by the
netapp.  I eventually was able to isolate kernel version as the only
factor in whether discard worked.  Further testing showed 3.10.x
handled discard correctly, but 4.4.x would never free blocks.  This
was verified on a single machine with the only change being the
kernel.

The only notable difference I could find was in
/sys/block/sdX/discard* values - on 3.10.x the discard granularity was
reported as 4096, while on 4.4.x it was 512 (logical block size is
512, physical is 4096 on the LUNs).  Eventually that led me to these
patches for sd.c:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/drivers/scsi/sd.c?id=397737223c59e89dca7305feb6528caef8fbef84
and https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/drivers/scsi/sd.c?id=f4327a95dd080ed6aecb185478a88ce1ee4fa3c4.
They result in discard granularity being forced to logical block size
if the disk reports LBPRZ is enabled (which the netapp luns do).  It
seems that this change is responsible for the difference in discard
granularity, and my assumption is that because wafl is actually a 4k
block filesystem the netapp requires 4k granularity and ignores the
512b discard requests.

It's not clear to me whether this is a bug in sd or an issue in the
way the LUNs are presented from the netapp side (I've opened a case
with them as well and am waiting to hear back).  However,
minimum_io_size is 4096, so it seems a bit odd that
discard_granularity would be smaller.  And earlier kernel versions
work as expected, which seems to indicate the problem is in sd.

As far as fixes or workarounds, it seems that there are three potential options:

1) The netapp could change the reported logical block size to match
the physical block size
2) The netapp could report LBPRZ=0
3) The sd code could be updated to use max(logical_block_size,
physical_block_size) or  max(logical_block_size, minimum_io_size) or
otherwise changed to ensure discard_granularity is set to a supported
value

I'm not sure of the implications of either the netapp changes, though
reporting 4k logical blocks seems potential as this is supported in
newer OS at least.  The sd change potentially would at least partially
undo the patches referenced above.  But it would seem that (assuming
an aligned filesystem with 4k blocks and minimum_io_size=4096) there
is no possibility of a partial block discard or advantage to sending
the discard requests in 512 blocks?

Any help is greatly appreciated.

Thanks,
-David

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: problem with discard granularity in sd
@ 2017-04-11 18:07 David Buckley
  2017-04-12  1:55 ` Martin K. Petersen
  0 siblings, 1 reply; 9+ messages in thread
From: David Buckley @ 2017-04-11 18:07 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi

(re-sending as I somehow lost the subject on my previous reply)

Martin,

I'm rather surprised nobody else has previously reported this as well,
especially as NetApp hadn't received any reports.  The only probable
explanation I could think of is that EL 7 is still based on a 3.10
kernel so is too old to be affected, and that is likely to be what
most NetApp customers are running.



I've been trying to test some additional kernels to try and narrow the
versions affected, and the change in discard_granularity does not
correspond with discard failing (as you suggested).

With kernel 3.13,  discard works as expected with
discard_granularity=4096 and discard_max_bytes=4194304.

With kernel 3.19, discard stops working completely, with
discard_granularity=4096 and discard_max_bytes=8388608.

Do you think the change in discard_max_bytes could be relevant?  If
that comes from the VPD data it seems odd it would change.

I am wondering if part of the issue is that in my use case, UNMAP and
WRITE SAME zeros result in very different results.  With thin
provisioned LUNs, UNMAP requests result in the blocks being freed and
thus reduces the actual size of the LUN allocation on disk.  If WRITE SAME
requests are used to zero the blocks, they remain allocated and thus the
real size of the LUN grows to match the allocated size (effectively
thick-provisioning the LUN).  This matches what I am
seeing on kernels with discard not working; running 'fstrim
/lun/mount' results in the LUN growing to its max size.

Thank you again for your help with this!

-David

On Thu, Apr 6, 2017 at 10:34 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
> David Buckley <dbuckley@oreilly.com> writes:
>
> David,
>
>> As I mentioned previously, I'm fairly certain that the issue I'm
>> seeing is due to the fact that while NetApp LUNs are presented as 512B
>> logical/4K physical disks for compatibility, they actually don't
>> support requests smaller than 4K (which makes sense as NetApp LUNs are
>> actually just files allocated on the 4K-block WAFL filesystem).
>
> That may be. But they should still deallocate all the whole 4K blocks
> described by an UNMAP request. Even if head and tail are not aligned.
>
>> Let me know if there's any additional information I can provide. This
>> has resulted in a 2-3x increase in raw disk requirements for some
>> workloads (unfortunately on SSD too), and I'd love to find a solution
>> that doesn't require rolling back to a 3.10 kernel.
>
> I just posted some patches yesterday that will address this (using WRITE
> SAME w/ UNMAP for block zeroing and allowing discards to be sent using
> the UNMAP command, honoring the granularity and alignment suggested by
> the device). That's 4.13 material, though.
>
> The enterprise distros have many customers using NetApp filers. I'm a
> bit puzzled why this is the first we hear of this...
>
> --
> Martin K. Petersen      Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-04-14 20:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-03-31 16:52 problem with discard granularity in sd David Buckley
2017-04-05  0:12 ` Martin K. Petersen
2017-04-05 16:14   ` David Buckley
2017-04-06 17:34     ` Martin K. Petersen
  -- strict thread matches above, loose matches on Subject: below --
2017-04-11 18:07 David Buckley
2017-04-12  1:55 ` Martin K. Petersen
2017-04-12 23:58   ` David Buckley
2017-04-14  2:44     ` Martin K. Petersen
2017-04-14 20:07       ` David Buckley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).