Intermittent zeroed pages with AIO+DIO+XFS

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Intermittent zeroed pages with AIO+DIO+XFS
@ 2017-08-03 14:52 Avi Kivity
  2017-08-03 22:09 ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Avi Kivity @ 2017-08-03 14:52 UTC (permalink / raw)
  To: linux-xfs; +Cc: Glauber Costa, Raphael Carvalho

Hello,

I have an application that uses AIO+DIO to write data to a file on XFS. 
The writes use 128k buffers. Very rarely, I see aligned 4k blocks within 
the file that are zeroed. The blocks are not aligned to 128k boundary, 
just 4k. The buffers are allocated in anonymous memory, which is usually 
using transparent hugepages.  The files are fully allocated, not sparse 
(checked post-mortem).

The writes are concurrent and adjacent. To avoid serialization, we 
ftruncate() the file to a larger size, then ftruncate() it back when we 
know its final size.

Does this trigger anything in anyone's mind?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Intermittent zeroed pages with AIO+DIO+XFS
  2017-08-03 14:52 Intermittent zeroed pages with AIO+DIO+XFS Avi Kivity
@ 2017-08-03 22:09 ` Dave Chinner
  2017-08-04  2:40   ` Avi Kivity
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2017-08-03 22:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs, Glauber Costa, Raphael Carvalho

On Thu, Aug 03, 2017 at 05:52:45PM +0300, Avi Kivity wrote:
> Hello,
> 

Hi Avi,

> I have an application that uses AIO+DIO to write data to a file on
> XFS. The writes use 128k buffers. Very rarely, I see aligned 4k
> blocks within the file that are zeroed. The blocks are not aligned
> to 128k boundary, just 4k. The buffers are allocated in anonymous
> memory, which is usually using transparent hugepages.  The files are
> fully allocated, not sparse (checked post-mortem).

Did you check that the extents are written? i.e. there aren't
sporadic 4k unwritten extents in the file? (xfs_bmap -vvp output)

If you turn off transparent huge pages, does the problem go
away?

What kernel version is this seen on? We've changed the XFS DIO
IO path implementation substantially in recent times....

> The writes are concurrent and adjacent. To avoid serialization, we
> ftruncate() the file to a larger size, then ftruncate() it back when
> we know its final size.

So it's not extending the file on the writes, so it shouldn't be
triggering EOF block zeroing. The only thing I can think of is
either the data contains zeros or there's an occasional unwritten
extent in the file.

> Does this trigger anything in anyone's mind?

Nope - do you have a reproducer you can share?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Intermittent zeroed pages with AIO+DIO+XFS
  2017-08-03 22:09 ` Dave Chinner
@ 2017-08-04  2:40   ` Avi Kivity
  2017-08-04  2:50     ` Glauber Costa
  2017-08-04  3:14     ` Dave Chinner
  0 siblings, 2 replies; 7+ messages in thread
From: Avi Kivity @ 2017-08-04  2:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Glauber Costa, Raphael Carvalho

On 08/04/2017 01:09 AM, Dave Chinner wrote:
> On Thu, Aug 03, 2017 at 05:52:45PM +0300, Avi Kivity wrote:
>> Hello,
>>
> Hi Avi,
>
>> I have an application that uses AIO+DIO to write data to a file on
>> XFS. The writes use 128k buffers. Very rarely, I see aligned 4k
>> blocks within the file that are zeroed. The blocks are not aligned
>> to 128k boundary, just 4k. The buffers are allocated in anonymous
>> memory, which is usually using transparent hugepages.  The files are
>> fully allocated, not sparse (checked post-mortem).
> Did you check that the extents are written? i.e. there aren't
> sporadic 4k unwritten extents in the file? (xfs_bmap -vvp output)

Raphael did that, and the result was that the file was NOT sparse.

btw, we also run with the extent size hint set to 32MB.

> If you turn off transparent huge pages, does the problem go
> away?

We did not check yet.

> What kernel version is this seen on? We've changed the XFS DIO
> IO path implementation substantially in recent times....

CentOS 7.2's kernel. Glauber, do you now the precise version string?

>> The writes are concurrent and adjacent. To avoid serialization, we
>> ftruncate() the file to a larger size, then ftruncate() it back when
>> we know its final size.
> So it's not extending the file on the writes, so it shouldn't be
> triggering EOF block zeroing. The only thing I can think of is
> either the data contains zeros or there's an occasional unwritten
> extent in the file.

The data is compressed, so it can't contain zeros originally. Of course 
it's possible the application zeroed that page after preparing the 
buffer and before the write hit the disk, but that's fairly unlikely. 
Zeroing pages is a kernel thing; even if the application allocated 4k of 
memory (not very common, but it does happen), it wouldn't zero it; and 
that buffer of course is held during the write.

We're adding code to check the buffer before and after the write, and 
also read back from disk.

>
>> Does this trigger anything in anyone's mind?
> Nope - do you have a reproducer you can share?
>

Run a certain NoSQL database for months on a cluster with lots of 
activity, and _may_ see it a few time. It's very rare, but it's there.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Intermittent zeroed pages with AIO+DIO+XFS
  2017-08-04  2:40   ` Avi Kivity
@ 2017-08-04  2:50     ` Glauber Costa
  2017-08-04  3:14     ` Dave Chinner
  1 sibling, 0 replies; 7+ messages in thread
From: Glauber Costa @ 2017-08-04  2:50 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Dave Chinner, linux-xfs, Raphael Carvalho

On Thu, Aug 3, 2017 at 10:40 PM, Avi Kivity <avi@scylladb.com> wrote:
> On 08/04/2017 01:09 AM, Dave Chinner wrote:
>>
>> On Thu, Aug 03, 2017 at 05:52:45PM +0300, Avi Kivity wrote:
>>>
>>> Hello,
>>>
>> Hi Avi,
>>
>>> I have an application that uses AIO+DIO to write data to a file on
>>> XFS. The writes use 128k buffers. Very rarely, I see aligned 4k
>>> blocks within the file that are zeroed. The blocks are not aligned
>>> to 128k boundary, just 4k. The buffers are allocated in anonymous
>>> memory, which is usually using transparent hugepages.  The files are
>>> fully allocated, not sparse (checked post-mortem).
>>
>> Did you check that the extents are written? i.e. there aren't
>> sporadic 4k unwritten extents in the file? (xfs_bmap -vvp output)
>
>
> Raphael did that, and the result was that the file was NOT sparse.
>
> btw, we also run with the extent size hint set to 32MB.
>
>> If you turn off transparent huge pages, does the problem go
>> away?
>
>
> We did not check yet.
>
>> What kernel version is this seen on? We've changed the XFS DIO
>> IO path implementation substantially in recent times....
>
>
> CentOS 7.2's kernel. Glauber, do you now the precise version string?

Yes I do, sir!

3.10.0-327.el7.x86_64

(Hey, Dave!)

>
>>> The writes are concurrent and adjacent. To avoid serialization, we
>>> ftruncate() the file to a larger size, then ftruncate() it back when
>>> we know its final size.
>>
>> So it's not extending the file on the writes, so it shouldn't be
>> triggering EOF block zeroing. The only thing I can think of is
>> either the data contains zeros or there's an occasional unwritten
>> extent in the file.
>
>
> The data is compressed, so it can't contain zeros originally. Of course it's
> possible the application zeroed that page after preparing the buffer and
> before the write hit the disk, but that's fairly unlikely. Zeroing pages is
> a kernel thing; even if the application allocated 4k of memory (not very
> common, but it does happen), it wouldn't zero it; and that buffer of course
> is held during the write.
>
> We're adding code to check the buffer before and after the write, and also
> read back from disk.
>
>>
>>> Does this trigger anything in anyone's mind?
>>
>> Nope - do you have a reproducer you can share?
>>
>
> Run a certain NoSQL database for months on a cluster with lots of activity,
> and _may_ see it a few time. It's very rare, but it's there.
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Intermittent zeroed pages with AIO+DIO+XFS
  2017-08-04  2:40   ` Avi Kivity
  2017-08-04  2:50     ` Glauber Costa
@ 2017-08-04  3:14     ` Dave Chinner
  2017-08-04  3:36       ` Avi Kivity
  1 sibling, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2017-08-04  3:14 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linux-xfs, Glauber Costa, Raphael Carvalho

On Fri, Aug 04, 2017 at 05:40:07AM +0300, Avi Kivity wrote:
> On 08/04/2017 01:09 AM, Dave Chinner wrote:
> >On Thu, Aug 03, 2017 at 05:52:45PM +0300, Avi Kivity wrote:
> >>Hello,
> >>
> >Hi Avi,
> >
> >>I have an application that uses AIO+DIO to write data to a file on
> >>XFS. The writes use 128k buffers. Very rarely, I see aligned 4k
> >>blocks within the file that are zeroed. The blocks are not aligned
> >>to 128k boundary, just 4k. The buffers are allocated in anonymous
> >>memory, which is usually using transparent hugepages.  The files are
> >>fully allocated, not sparse (checked post-mortem).
> >Did you check that the extents are written? i.e. there aren't
> >sporadic 4k unwritten extents in the file? (xfs_bmap -vvp output)
> 
> Raphael did that, and the result was that the file was NOT sparse.

Sure, but a file with unwritten extents is not sparse. It's just got
extents that will always read as zeros. The extra "-vvp" output
tells you the unwritten flag state and does not merge contiguous
extents that differ only in state.

i.e:

$ sudo xfs_io -fd -c "falloc 0 1M" -c "pwrite 900k 200k" /mnt/scratch/foo
wrote 204800/204800 bytes at offset 921600
200 KiB, 50 ops; 0.0000 sec (13.838 MiB/sec and 3542.5818 ops/sec)
$ sudo xfs_bmap /mnt/scratch/foo
/mnt/scratch/foo:
        0: [0..2199]: 160..2359

Looks fully allocated. However:

$ sudo xfs_bmap -vvp /mnt/scratch/foo
/mnt/scratch/foo:
 EXT: FILE-OFFSET      BLOCK-RANGE       AG AG-OFFSET        TOTAL FLAGS
   0: [0..1799]:       160..1959          0 (160..1959)       1800 010000
   1: [1800..2199]:    1960..2359         0 (1960..2359)       400 000000
 FLAG Values:
    0100000 Shared extent
    0010000 Unwritten preallocated extent
    0001000 Doesn't begin on stripe unit
    0000100 Doesn't end   on stripe unit
    0000010 Doesn't begin on stripe width
    0000001 Doesn't end   on stripe width
$

The first 900k of the file is an unwritten extent, which returns
zeros...

> btw, we also run with the extent size hint set to 32MB.

Which means that space is definitely being allocated as unwritten
extents, then overwritten and converted on IO completion. Hence if
the overwrite is not complete, or there's a bug in the unwritten
extent conversion, it may leave unwritten extents where it
shouldn't....

> >What kernel version is this seen on? We've changed the XFS DIO
> >IO path implementation substantially in recent times....
> 
> CentOS 7.2's kernel. Glauber, do you now the precise version string?

Can you reproduce on an upstream kernel? Problems with highly
patched distro kernels really need to be directed to the distro...

> >>Does this trigger anything in anyone's mind?
> >Nope - do you have a reproducer you can share?
> >
> 
> Run a certain NoSQL database for months on a cluster with lots of
> activity, and _may_ see it a few time. It's very rare, but it's
> there.

Needle in a haystack, then - the problem could be anywhere in the
storage stack, including hardware. You're going to need to
isolate the problem to the filesystem for us, which means a
reproducer script of some kind...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Intermittent zeroed pages with AIO+DIO+XFS
  2017-08-04  3:14     ` Dave Chinner
@ 2017-08-04  3:36       ` Avi Kivity
  2017-08-04  4:04         ` Raphael S. Carvalho
  0 siblings, 1 reply; 7+ messages in thread
From: Avi Kivity @ 2017-08-04  3:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Glauber Costa, Raphael Carvalho

On 08/04/2017 06:14 AM, Dave Chinner wrote:
> On Fri, Aug 04, 2017 at 05:40:07AM +0300, Avi Kivity wrote:
>> On 08/04/2017 01:09 AM, Dave Chinner wrote:
>>> On Thu, Aug 03, 2017 at 05:52:45PM +0300, Avi Kivity wrote:
>>>> Hello,
>>>>
>>> Hi Avi,
>>>
>>>> I have an application that uses AIO+DIO to write data to a file on
>>>> XFS. The writes use 128k buffers. Very rarely, I see aligned 4k
>>>> blocks within the file that are zeroed. The blocks are not aligned
>>>> to 128k boundary, just 4k. The buffers are allocated in anonymous
>>>> memory, which is usually using transparent hugepages.  The files are
>>>> fully allocated, not sparse (checked post-mortem).
>>> Did you check that the extents are written? i.e. there aren't
>>> sporadic 4k unwritten extents in the file? (xfs_bmap -vvp output)
>> Raphael did that, and the result was that the file was NOT sparse.
> Sure, but a file with unwritten extents is not sparse. It's just got
> extents that will always read as zeros. The extra "-vvp" output
> tells you the unwritten flag state and does not merge contiguous
> extents that differ only in state.

Ah, thanks for the explanation. Raphael, can you check this?

> i.e:
>
> $ sudo xfs_io -fd -c "falloc 0 1M" -c "pwrite 900k 200k" /mnt/scratch/foo
> wrote 204800/204800 bytes at offset 921600
> 200 KiB, 50 ops; 0.0000 sec (13.838 MiB/sec and 3542.5818 ops/sec)
> $ sudo xfs_bmap /mnt/scratch/foo
> /mnt/scratch/foo:
>          0: [0..2199]: 160..2359
>
> Looks fully allocated. However:
>
> $ sudo xfs_bmap -vvp /mnt/scratch/foo
> /mnt/scratch/foo:
>   EXT: FILE-OFFSET      BLOCK-RANGE       AG AG-OFFSET        TOTAL FLAGS
>     0: [0..1799]:       160..1959          0 (160..1959)       1800 010000
>     1: [1800..2199]:    1960..2359         0 (1960..2359)       400 000000
>   FLAG Values:
>      0100000 Shared extent
>      0010000 Unwritten preallocated extent
>      0001000 Doesn't begin on stripe unit
>      0000100 Doesn't end   on stripe unit
>      0000010 Doesn't begin on stripe width
>      0000001 Doesn't end   on stripe width
> $
>
> The first 900k of the file is an unwritten extent, which returns
> zeros...
>
>> btw, we also run with the extent size hint set to 32MB.
> Which means that space is definitely being allocated as unwritten
> extents, then overwritten and converted on IO completion. Hence if
> the overwrite is not complete, or there's a bug in the unwritten
> extent conversion, it may leave unwritten extents where it
> shouldn't....
>
>>> What kernel version is this seen on? We've changed the XFS DIO
>>> IO path implementation substantially in recent times....
>> CentOS 7.2's kernel. Glauber, do you now the precise version string?
> Can you reproduce on an upstream kernel? Problems with highly
> patched distro kernels really need to be directed to the distro...

This is a production cluster, and we've only seen the problem in this 
one cluster, and _very_ rarely there.

>>>> Does this trigger anything in anyone's mind?
>>> Nope - do you have a reproducer you can share?
>>>
>> Run a certain NoSQL database for months on a cluster with lots of
>> activity, and _may_ see it a few time. It's very rare, but it's
>> there.
> Needle in a haystack, then - the problem could be anywhere in the
> storage stack, including hardware.

Yes, unfortunately.

>   You're going to need to
> isolate the problem to the filesystem for us, which means a
> reproducer script of some kind...

It's very unlikely we'll find a simple reproducer; this email was more 
to see if the list has seen this problem before rather than as a 
detailed bug report.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Intermittent zeroed pages with AIO+DIO+XFS
  2017-08-04  3:36       ` Avi Kivity
@ 2017-08-04  4:04         ` Raphael S. Carvalho
  0 siblings, 0 replies; 7+ messages in thread
From: Raphael S. Carvalho @ 2017-08-04  4:04 UTC (permalink / raw)
  Cc: linux-xfs

On Fri, Aug 4, 2017 at 12:36 AM, Avi Kivity <avi@scylladb.com> wrote:
> On 08/04/2017 06:14 AM, Dave Chinner wrote:
>>
>> On Fri, Aug 04, 2017 at 05:40:07AM +0300, Avi Kivity wrote:
>>>
>>> On 08/04/2017 01:09 AM, Dave Chinner wrote:
>>>>
>>>> On Thu, Aug 03, 2017 at 05:52:45PM +0300, Avi Kivity wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>> Hi Avi,
>>>>
>>>>> I have an application that uses AIO+DIO to write data to a file on
>>>>> XFS. The writes use 128k buffers. Very rarely, I see aligned 4k
>>>>> blocks within the file that are zeroed. The blocks are not aligned
>>>>> to 128k boundary, just 4k. The buffers are allocated in anonymous
>>>>> memory, which is usually using transparent hugepages.  The files are
>>>>> fully allocated, not sparse (checked post-mortem).
>>>>
>>>> Did you check that the extents are written? i.e. there aren't
>>>> sporadic 4k unwritten extents in the file? (xfs_bmap -vvp output)
>>>
>>> Raphael did that, and the result was that the file was NOT sparse.
>>
>> Sure, but a file with unwritten extents is not sparse. It's just got
>> extents that will always read as zeros. The extra "-vvp" output
>> tells you the unwritten flag state and does not merge contiguous
>> extents that differ only in state.
>
>
> Ah, thanks for the explanation. Raphael, can you check this?

Hi, everyone.

All extents have the flag 01111, which if I understand correctly, they
are everything but unwritten.

I was curious if there's any chance there's still an unknown bug which
is somewhat related to this one:
http://oss.sgi.com/archives/xfs/2015-04/msg00159.html. We no longer
submit size-changing ops in parallel though, they're now serialized. I
checked that kernel of the system which reproduced this issue contains
the fix aforementioned.

>
>
>> i.e:
>>
>> $ sudo xfs_io -fd -c "falloc 0 1M" -c "pwrite 900k 200k" /mnt/scratch/foo
>> wrote 204800/204800 bytes at offset 921600
>> 200 KiB, 50 ops; 0.0000 sec (13.838 MiB/sec and 3542.5818 ops/sec)
>> $ sudo xfs_bmap /mnt/scratch/foo
>> /mnt/scratch/foo:
>>          0: [0..2199]: 160..2359
>>
>> Looks fully allocated. However:
>>
>> $ sudo xfs_bmap -vvp /mnt/scratch/foo
>> /mnt/scratch/foo:
>>   EXT: FILE-OFFSET      BLOCK-RANGE       AG AG-OFFSET        TOTAL FLAGS
>>     0: [0..1799]:       160..1959          0 (160..1959)       1800 010000
>>     1: [1800..2199]:    1960..2359         0 (1960..2359)       400 000000
>>   FLAG Values:
>>      0100000 Shared extent
>>      0010000 Unwritten preallocated extent
>>      0001000 Doesn't begin on stripe unit
>>      0000100 Doesn't end   on stripe unit
>>      0000010 Doesn't begin on stripe width
>>      0000001 Doesn't end   on stripe width
>> $
>>
>> The first 900k of the file is an unwritten extent, which returns
>> zeros...
>>
>>> btw, we also run with the extent size hint set to 32MB.
>>
>> Which means that space is definitely being allocated as unwritten
>> extents, then overwritten and converted on IO completion. Hence if
>> the overwrite is not complete, or there's a bug in the unwritten
>> extent conversion, it may leave unwritten extents where it
>> shouldn't....
>>
>>>> What kernel version is this seen on? We've changed the XFS DIO
>>>> IO path implementation substantially in recent times....
>>>
>>> CentOS 7.2's kernel. Glauber, do you now the precise version string?
>>
>> Can you reproduce on an upstream kernel? Problems with highly
>> patched distro kernels really need to be directed to the distro...
>
>
> This is a production cluster, and we've only seen the problem in this one
> cluster, and _very_ rarely there.
>
>>>>> Does this trigger anything in anyone's mind?
>>>>
>>>> Nope - do you have a reproducer you can share?
>>>>
>>> Run a certain NoSQL database for months on a cluster with lots of
>>> activity, and _may_ see it a few time. It's very rare, but it's
>>> there.
>>
>> Needle in a haystack, then - the problem could be anywhere in the
>> storage stack, including hardware.
>
>
> Yes, unfortunately.
>
>>   You're going to need to
>> isolate the problem to the filesystem for us, which means a
>> reproducer script of some kind...
>
>
> It's very unlikely we'll find a simple reproducer; this email was more to
> see if the list has seen this problem before rather than as a detailed bug
> report.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-08-04  4:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-08-03 14:52 Intermittent zeroed pages with AIO+DIO+XFS Avi Kivity
2017-08-03 22:09 ` Dave Chinner
2017-08-04  2:40   ` Avi Kivity
2017-08-04  2:50     ` Glauber Costa
2017-08-04  3:14     ` Dave Chinner
2017-08-04  3:36       ` Avi Kivity
2017-08-04  4:04         ` Raphael S. Carvalho

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox