problem with discard granularity in sd

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* problem with discard granularity in sd
@ 2017-03-31 16:52 David Buckley
  2017-04-05  0:12 ` Martin K. Petersen
  0 siblings, 1 reply; 9+ messages in thread
From: David Buckley @ 2017-03-31 16:52 UTC (permalink / raw)
  To: linux-scsi

Hello,

Hopefully this is the right place for this, and apologies for the
lengthy mail.  I'm struggling with an issue with SCSI UNMAP/discard in
newer kernels, and I'm hoping to find a resolution or at least to
better understand why this has changed.

Some background info:
Our Linux boxes are primarily VMs running on VMware backed by NetApp
storage.  We have a fair number of systems that directly mount LUNs
(due to i/o requirements, snapshot scheduling, dedupe issues, etc.).
On newer LUNs, the 'space_alloc' option is enabled, which causes the
LUN to report unmap support and free unused blocks on the underlying
storage.

The problem:
I noticed multiple LUNs with space_alloc enabled reported 100%
utilization on the netapp but much less from the Linux. I verified
they were mounted with discard option and also ran fstrim, which
reported success but did not change the utilization reported by the
netapp.  I eventually was able to isolate kernel version as the only
factor in whether discard worked.  Further testing showed 3.10.x
handled discard correctly, but 4.4.x would never free blocks.  This
was verified on a single machine with the only change being the
kernel.

The only notable difference I could find was in
/sys/block/sdX/discard* values - on 3.10.x the discard granularity was
reported as 4096, while on 4.4.x it was 512 (logical block size is
512, physical is 4096 on the LUNs).  Eventually that led me to these
patches for sd.c:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/drivers/scsi/sd.c?id=397737223c59e89dca7305feb6528caef8fbef84
and https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/drivers/scsi/sd.c?id=f4327a95dd080ed6aecb185478a88ce1ee4fa3c4.
They result in discard granularity being forced to logical block size
if the disk reports LBPRZ is enabled (which the netapp luns do).  It
seems that this change is responsible for the difference in discard
granularity, and my assumption is that because wafl is actually a 4k
block filesystem the netapp requires 4k granularity and ignores the
512b discard requests.

It's not clear to me whether this is a bug in sd or an issue in the
way the LUNs are presented from the netapp side (I've opened a case
with them as well and am waiting to hear back).  However,
minimum_io_size is 4096, so it seems a bit odd that
discard_granularity would be smaller.  And earlier kernel versions
work as expected, which seems to indicate the problem is in sd.

As far as fixes or workarounds, it seems that there are three potential options:

1) The netapp could change the reported logical block size to match
the physical block size
2) The netapp could report LBPRZ=0
3) The sd code could be updated to use max(logical_block_size,
physical_block_size) or  max(logical_block_size, minimum_io_size) or
otherwise changed to ensure discard_granularity is set to a supported
value

I'm not sure of the implications of either the netapp changes, though
reporting 4k logical blocks seems potential as this is supported in
newer OS at least.  The sd change potentially would at least partially
undo the patches referenced above.  But it would seem that (assuming
an aligned filesystem with 4k blocks and minimum_io_size=4096) there
is no possibility of a partial block discard or advantage to sending
the discard requests in 512 blocks?

Any help is greatly appreciated.

Thanks,
-David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: problem with discard granularity in sd
  2017-03-31 16:52 problem with discard granularity in sd David Buckley
@ 2017-04-05  0:12 ` Martin K. Petersen
  2017-04-05 16:14   ` David Buckley
  0 siblings, 1 reply; 9+ messages in thread
From: Martin K. Petersen @ 2017-04-05  0:12 UTC (permalink / raw)
  To: David Buckley; +Cc: linux-scsi

David Buckley <dbuckley@oreilly.com> writes:

David,

> They result in discard granularity being forced to logical block size
> if the disk reports LBPRZ is enabled (which the netapp luns do).

Block zeroing and unmapping are currently sharing some plumbing and that
has lead to some compromises. In this case a bias towards ensuring data
integrity for zeroing at the expense of not aligning unmap requests.

Christoph has worked on separating those two functions. His code is
currently under review.

> I'm not sure of the implications of either the netapp changes, though
> reporting 4k logical blocks seems potential as this is supported in
> newer OS at least.

Yes, but it may break legacy applications that assume a 512-byte logical
block size.

> The sd change potentially would at least partially undo the patches
> referenced above.  But it would seem that (assuming an aligned
> filesystem with 4k blocks and minimum_io_size=4096) there is no
> possibility of a partial block discard or advantage to sending the
> discard requests in 512 blocks?

The unmap granularity inside a device is often much, much bigger than
4K. So aligning to that probably won't make a difference. And it's
imperative to filesystems that zeroing works at logical block size
granularity.

The expected behavior for a device is that it unmaps whichever full
unmap granularity chunks are described by a received request. And then
explicitly zeroes any partial chunks at the head and tail. So I am
surprised you see no reclamation whatsoever.

With the impending zero/unmap separation things might fare better. But
I'd still like to understand the behavior you observe. Please provide
the output of:

sg_vpd -p lbpv /dev/sdN
sg_vpd -p bl /dev/sdN

for one of the LUNs and I'll take a look.

Thanks!

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: problem with discard granularity in sd
  2017-04-05  0:12 ` Martin K. Petersen
@ 2017-04-05 16:14   ` David Buckley
  2017-04-06 17:34     ` Martin K. Petersen
  0 siblings, 1 reply; 9+ messages in thread
From: David Buckley @ 2017-04-05 16:14 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi

Hi Martin,

I really appreciate the response. Here is the VPD page data you asked for:

Logical block provisioning VPD page (SBC):
  Unmap command supported (LBPU): 1
  Write same (16) with unmap bit supported (LBWS): 1
  Write same (10) with unmap bit supported (LBWS10): 1
  Logical block provisioning read zeros (LBPRZ): 1
  Anchored LBAs supported (ANC_SUP): 0
  Threshold exponent: 0
  Descriptor present (DP): 0
  Provisioning type: 2

Block limits VPD page (SBC):
  Write same no zero (WSNZ): 1
  Maximum compare and write length: 1 blocks
  Optimal transfer length granularity: 8 blocks
  Maximum transfer length: 8388607 blocks
  Optimal transfer length: 128 blocks
  Maximum prefetch length: 0 blocks
  Maximum unmap LBA count: 8192
  Maximum unmap block descriptor count: 64
  Optimal unmap granularity: 8
  Unmap granularity alignment valid: 1
  Unmap granularity alignment: 0
  Maximum write same length: 0x4000 blocks


As I mentioned previously, I'm fairly certain that the issue I'm
seeing is due to the fact that while NetApp LUNs are presented as 512B
logical/4K physical disks for compatibility, they actually don't
support requests smaller than 4K (which makes sense as NetApp LUNs are
actually just files allocated on the 4K-block WAFL filesystem).

If the optimal granularity values from VPD are respected (as is the
case with 3.10 kernels), the result is:

minimum_io_size = 512 *8 = 4096  (logical_block_size * VPD optimal
transfer length granularity)
discard_granularity = 512 * 8 = 4096  (logical_block_size * VPD
optimal unmap granularity)


With recent kernels the value of minimum_io_size is unchanged, but
discard_granularity is explicitly set to logical_block_size which
results in unmap requests either being dropped or ignored.

I understand that the VPD values from LUNs may be atypical compared to
physical disks, and I expect there are some major differences in how
unmap (and zeroing) is handled between physical disks and LUNs.  But
it is unfortunate that a solution for misaligned I/O would break
fully-aligned requests which worked flawlessly in previous kernels.

Let me know if there's any additional information I can provide. This
has resulted in a 2-3x increase in raw disk requirements for some
workloads (unfortunately on SSD too), and I'd love to find a solution
that doesn't require rolling back to a 3.10 kernel.

Thanks again!
-David



On Tue, Apr 4, 2017 at 5:12 PM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
> David Buckley <dbuckley@oreilly.com> writes:
>
> David,
>
>> They result in discard granularity being forced to logical block size
>> if the disk reports LBPRZ is enabled (which the netapp luns do).
>
> Block zeroing and unmapping are currently sharing some plumbing and that
> has lead to some compromises. In this case a bias towards ensuring data
> integrity for zeroing at the expense of not aligning unmap requests.
>
> Christoph has worked on separating those two functions. His code is
> currently under review.
>
>> I'm not sure of the implications of either the netapp changes, though
>> reporting 4k logical blocks seems potential as this is supported in
>> newer OS at least.
>
> Yes, but it may break legacy applications that assume a 512-byte logical
> block size.
>
>> The sd change potentially would at least partially undo the patches
>> referenced above.  But it would seem that (assuming an aligned
>> filesystem with 4k blocks and minimum_io_size=4096) there is no
>> possibility of a partial block discard or advantage to sending the
>> discard requests in 512 blocks?
>
> The unmap granularity inside a device is often much, much bigger than
> 4K. So aligning to that probably won't make a difference. And it's
> imperative to filesystems that zeroing works at logical block size
> granularity.
>
> The expected behavior for a device is that it unmaps whichever full
> unmap granularity chunks are described by a received request. And then
> explicitly zeroes any partial chunks at the head and tail. So I am
> surprised you see no reclamation whatsoever.
>
> With the impending zero/unmap separation things might fare better. But
> I'd still like to understand the behavior you observe. Please provide
> the output of:
>
> sg_vpd -p lbpv /dev/sdN
> sg_vpd -p bl /dev/sdN
>
> for one of the LUNs and I'll take a look.
>
> Thanks!
>
> --
> Martin K. Petersen      Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: problem with discard granularity in sd
  2017-04-05 16:14   ` David Buckley
@ 2017-04-06 17:34     ` Martin K. Petersen
  0 siblings, 0 replies; 9+ messages in thread
From: Martin K. Petersen @ 2017-04-06 17:34 UTC (permalink / raw)
  To: David Buckley; +Cc: Martin K. Petersen, linux-scsi

David Buckley <dbuckley@oreilly.com> writes:

David,

> As I mentioned previously, I'm fairly certain that the issue I'm
> seeing is due to the fact that while NetApp LUNs are presented as 512B
> logical/4K physical disks for compatibility, they actually don't
> support requests smaller than 4K (which makes sense as NetApp LUNs are
> actually just files allocated on the 4K-block WAFL filesystem).

That may be. But they should still deallocate all the whole 4K blocks
described by an UNMAP request. Even if head and tail are not aligned.

> Let me know if there's any additional information I can provide. This
> has resulted in a 2-3x increase in raw disk requirements for some
> workloads (unfortunately on SSD too), and I'd love to find a solution
> that doesn't require rolling back to a 3.10 kernel.

I just posted some patches yesterday that will address this (using WRITE
SAME w/ UNMAP for block zeroing and allowing discards to be sent using
the UNMAP command, honoring the granularity and alignment suggested by
the device). That's 4.13 material, though.

The enterprise distros have many customers using NetApp filers. I'm a
bit puzzled why this is the first we hear of this...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: problem with discard granularity in sd
@ 2017-04-11 18:07 David Buckley
  2017-04-12  1:55 ` Martin K. Petersen
  0 siblings, 1 reply; 9+ messages in thread
From: David Buckley @ 2017-04-11 18:07 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi

(re-sending as I somehow lost the subject on my previous reply)

Martin,

I'm rather surprised nobody else has previously reported this as well,
especially as NetApp hadn't received any reports.  The only probable
explanation I could think of is that EL 7 is still based on a 3.10
kernel so is too old to be affected, and that is likely to be what
most NetApp customers are running.

I've been trying to test some additional kernels to try and narrow the
versions affected, and the change in discard_granularity does not
correspond with discard failing (as you suggested).

With kernel 3.13,  discard works as expected with
discard_granularity=4096 and discard_max_bytes=4194304.

With kernel 3.19, discard stops working completely, with
discard_granularity=4096 and discard_max_bytes=8388608.

Do you think the change in discard_max_bytes could be relevant?  If
that comes from the VPD data it seems odd it would change.

I am wondering if part of the issue is that in my use case, UNMAP and
WRITE SAME zeros result in very different results.  With thin
provisioned LUNs, UNMAP requests result in the blocks being freed and
thus reduces the actual size of the LUN allocation on disk.  If WRITE SAME
requests are used to zero the blocks, they remain allocated and thus the
real size of the LUN grows to match the allocated size (effectively
thick-provisioning the LUN).  This matches what I am
seeing on kernels with discard not working; running 'fstrim
/lun/mount' results in the LUN growing to its max size.

Thank you again for your help with this!

-David

On Thu, Apr 6, 2017 at 10:34 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
> David Buckley <dbuckley@oreilly.com> writes:
>
> David,
>
>> As I mentioned previously, I'm fairly certain that the issue I'm
>> seeing is due to the fact that while NetApp LUNs are presented as 512B
>> logical/4K physical disks for compatibility, they actually don't
>> support requests smaller than 4K (which makes sense as NetApp LUNs are
>> actually just files allocated on the 4K-block WAFL filesystem).
>
> That may be. But they should still deallocate all the whole 4K blocks
> described by an UNMAP request. Even if head and tail are not aligned.
>
>> Let me know if there's any additional information I can provide. This
>> has resulted in a 2-3x increase in raw disk requirements for some
>> workloads (unfortunately on SSD too), and I'd love to find a solution
>> that doesn't require rolling back to a 3.10 kernel.
>
> I just posted some patches yesterday that will address this (using WRITE
> SAME w/ UNMAP for block zeroing and allowing discards to be sent using
> the UNMAP command, honoring the granularity and alignment suggested by
> the device). That's 4.13 material, though.
>
> The enterprise distros have many customers using NetApp filers. I'm a
> bit puzzled why this is the first we hear of this...
>
> --
> Martin K. Petersen      Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: problem with discard granularity in sd
  2017-04-11 18:07 David Buckley
@ 2017-04-12  1:55 ` Martin K. Petersen
  2017-04-12 23:58   ` David Buckley
  0 siblings, 1 reply; 9+ messages in thread
From: Martin K. Petersen @ 2017-04-12  1:55 UTC (permalink / raw)
  To: David Buckley; +Cc: Martin K. Petersen, linux-scsi


David,

> I am wondering if part of the issue is that in my use case, UNMAP and
> WRITE SAME zeros result in very different results.  With thin
> provisioned LUNs, UNMAP requests result in the blocks being freed and
> thus reduces the actual size of the LUN allocation on disk.  If WRITE
> SAME requests are used to zero the blocks, they remain allocated and
> thus the real size of the LUN grows to match the allocated size
> (effectively thick-provisioning the LUN).

The filer explicitly reported support for WRITE SAME(10/16) with UNMAP.
It seems very odd that it would then completely ignore the UNMAP bit and
do a regular WRITE SAME.

Are you running latest firmware, btw.?

In any case. The changes I mentioned are now queued up for 4.12. But
it'll obviously take a while for those to trickle into the
distributions...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: problem with discard granularity in sd
  2017-04-12  1:55 ` Martin K. Petersen
@ 2017-04-12 23:58   ` David Buckley
  2017-04-14  2:44     ` Martin K. Petersen
  0 siblings, 1 reply; 9+ messages in thread
From: David Buckley @ 2017-04-12 23:58 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi

Hi Martin,

I have identified the key difference between the older working kernels
and newer ones; on older kernels the provisioning_mode is set to
'unmap', while on newer kernels it is 'writesame_16'.  This is due to
a change in sd_read_block_limits which prevents selection of
SD_LBP_UNMAP if the disk reports lbprz=1, which I'm assuming is the
desired behavior.

As I mentioned previously, one difference between working and failing
kernels is that discard_max_bytes is double the size on the latter.
To determine whether the NetApp incorrectly handles WRITE SAME with
UNMAP in general or if the increase in discard_max_bytes was
responsible, I swapped max_ws_blocks with max_unmap_blocks for WS16
requests in sd_config_discard in my test kernel.  Testing with this
change showed UNMAP requests working as expected, indicating that the
discard_max_bytes change is directly responsible for the failing
requests.

The underlying issue seems to be that (per VPD page) the maximum
supported unmap request is 8192 * 512 = 4194304 bytes, while the
maximum write same request is 0x4000 * 512 = 8388608 bytes.  It
appears both of these values are correct, and in the case where a WS
UNMAP request larger than 8192 blocks is received the UNMAP is ignored
and the result is effectively a WS ZERO request.

If I'm correct in my understanding, then it seems like the root cause
of the failures is that the current code assumes a disk will always
support WS UNMAP requests up to max_ws_blocks and does not account for
a case where max_unmap_blocks is smaller than max_ws_blocks.

I would think something like this could be a reasonable fix:

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index fe0f799..b0233c2 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -643,6 +643,8 @@ static void sd_config_discard(struct scsi_disk
*sdkp, unsigned int mode)
        struct request_queue *q = sdkp->disk->queue;
        unsigned int logical_block_size = sdkp->device->sector_size;
        unsigned int max_blocks = 0;
+        unsigned int max_ws_unmap_blocks =
+                min_not_zero(sdkp->max_ws_blocks, sdkp->max_unmap_blocks);

        q->limits.discard_zeroes_data = 0;

@@ -679,7 +681,7 @@ static void sd_config_discard(struct scsi_disk
*sdkp, unsigned int mode)
                break;

        case SD_LBP_WS16:
-               max_blocks = min_not_zero(sdkp->max_ws_blocks,
+               max_blocks = min_not_zero(max_ws_unmap_blocks,
                                          (u32)SD_MAX_WS16_BLOCKS);
                q->limits.discard_zeroes_data = sdkp->lbprz;
                break;

Thanks yet again for all your help.

-David

On Tue, Apr 11, 2017 at 6:55 PM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
> David,
>
>> I am wondering if part of the issue is that in my use case, UNMAP and
>> WRITE SAME zeros result in very different results.  With thin
>> provisioned LUNs, UNMAP requests result in the blocks being freed and
>> thus reduces the actual size of the LUN allocation on disk.  If WRITE
>> SAME requests are used to zero the blocks, they remain allocated and
>> thus the real size of the LUN grows to match the allocated size
>> (effectively thick-provisioning the LUN).
>
> The filer explicitly reported support for WRITE SAME(10/16) with UNMAP.
> It seems very odd that it would then completely ignore the UNMAP bit and
> do a regular WRITE SAME.
>
> Are you running latest firmware, btw.?
>
> In any case. The changes I mentioned are now queued up for 4.12. But
> it'll obviously take a while for those to trickle into the
> distributions...
>
> --
> Martin K. Petersen      Oracle Linux Engineering

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: problem with discard granularity in sd
  2017-04-12 23:58   ` David Buckley
@ 2017-04-14  2:44     ` Martin K. Petersen
  2017-04-14 20:07       ` David Buckley
  0 siblings, 1 reply; 9+ messages in thread
From: Martin K. Petersen @ 2017-04-14  2:44 UTC (permalink / raw)
  To: David Buckley; +Cc: Martin K. Petersen, linux-scsi

David Buckley <dbuckley@oreilly.com> writes:

David,

> The underlying issue seems to be that (per VPD page) the maximum
> supported unmap request is 8192 * 512 = 4194304 bytes, while the
> maximum write same request is 0x4000 * 512 = 8388608 bytes.  It
> appears both of these values are correct, and in the case where a WS
> UNMAP request larger than 8192 blocks is received the UNMAP is ignored
> and the result is effectively a WS ZERO request.

That's broken on the filer, then.

> If I'm correct in my understanding, then it seems like the root cause
> of the failures is that the current code assumes a disk will always
> support WS UNMAP requests up to max_ws_blocks and does not account for
> a case where max_unmap_blocks is smaller than max_ws_blocks.

SBC says:

    "A MAXIMUM UNMAP LBA COUNT field set to a non-zero value indicates
    the maximum number of LBAs that may be unmapped by an UNMAP
    command."

    "A MAXIMUM WRITE SAME LENGTH field set to a non-zero value indicates
    the maximum number of contiguous logical blocks that the device
    server allows to be unmapped or written in a single WRITE SAME
    command."

I'm confident that our behavior is correct and that the problem is on
the storage side.

The good news is that the new discard code would alleviate your problem
in that (based on what the filer reported in your setup) we'd use UNMAP
to deallocate and WRITE SAME to zero.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: problem with discard granularity in sd
  2017-04-14  2:44     ` Martin K. Petersen
@ 2017-04-14 20:07       ` David Buckley
  0 siblings, 0 replies; 9+ messages in thread
From: David Buckley @ 2017-04-14 20:07 UTC (permalink / raw)
  To: Martin K. Petersen; +Cc: linux-scsi

Hi Martin,

I had assumed that the VPD values for the LUNs were valid, and hadn't
even thought to check if they were standards-compliant.  Based on the
SBC descriptions, it is clear that you are correct and this is an
issue with the storage not reporting a proper max_ws_blocks.  At least
now I have enough details to demonstrate to NetApp that this is an
issue on their side, and also have the relevant details from SBC.

Thank you once again for your help in understanding the underlying
technical issues responsible for the WS UNMAP failures in my
environment.  I really appreciate you taking the time to respond,
especially as this didn't turn out to be an issue with kernel code at
all.

-David

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-04-14 20:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-03-31 16:52 problem with discard granularity in sd David Buckley
2017-04-05  0:12 ` Martin K. Petersen
2017-04-05 16:14   ` David Buckley
2017-04-06 17:34     ` Martin K. Petersen
  -- strict thread matches above, loose matches on Subject: below --
2017-04-11 18:07 David Buckley
2017-04-12  1:55 ` Martin K. Petersen
2017-04-12 23:58   ` David Buckley
2017-04-14  2:44     ` Martin K. Petersen
2017-04-14 20:07       ` David Buckley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).