All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kevin Cernekee <cernekee@gmail.com>
To: Ralf Baechle <ralf@linux-mips.org>
Cc: <dediao@cisco.com>, <dvomlehn@cisco.com>,
	<linux-mips@linux-mips.org>, <linux-kernel@vger.kernel.org>
Subject: [PATCH 0/5] MIPS: HIGHMEM DMA on noncoherent MIPS32 processors
Date: Mon, 6 Sep 2010 21:03:42 -0700	[thread overview]
Message-ID: <ea0143cf318153e615b660cb0210f464@localhost> (raw)

I had some time this weekend to revisit an old problem:

http://www.linux-mips.org/archives/linux-mips/2010-05/msg00009.html


PROBLEM #1:

When testing Dezhong's patch, I found that un-kmapped pages could get
passed into dma_map_sg().  The original patch did not handle this case.
This resulted in serious DMA coherency issues on my system, along the
lines of "/sbin/init segfaults on boot."

I found that this calling sequence, seen during SATA disk I/O, was the
culprit:

[<8002b3d4>] dma_map_sg+0x2ac/0x2b8                                             
[<802cdb70>] ata_qc_issue+0x274/0x37c                                           
[<802d383c>] ata_scsi_translate+0xb0/0x1e8                                      
[<802d7274>] ata_scsi_queuecmd+0xc8/0x2b8                                       
[<802b0630>] scsi_dispatch_cmd+0x118/0x2b8                                      
[<802b6c64>] scsi_request_fn+0x420/0x4d8                                        
[<80239550>] __blk_run_queue+0x84/0x19c                                         
[<802364a0>] elv_insert+0x158/0x2e4                                             
[<8023a398>] __make_request+0x11c/0x4d0                                         
[<80238144>] generic_make_request+0x348/0x4a0                                   
[<80238330>] submit_bio+0x94/0x13c                                              
[<80114024>] mpage_bio_submit+0x30/0x40                                         
[<80115a2c>] mpage_readpages+0x134/0x170                                        
[<800a4fd4>] __do_page_cache_readahead+0x214/0x314                              

__do_page_cache_readahead() allocates a bunch of pages for the I/O
requests, but I do not see them getting mapped to kernel addresses via
kmap() / kmap_atomic().  This explains why (PageHighMem(page) &&
!kmap_high_get(page)) can be true.

I do not think there is anything wrong with this behavior, so it seems
like dma_map_sg() will need to handle it somehow.  Looking at other
architectures:

The PPC approach (__dma_sync_page_highmem()) is to disable IRQs and call
kmap_atomic() to create a temporary mapping for the page.  Disabling
IRQs is necessary because it is possible (but not required) for the
dma_* functions to be invoked from interrupt context; it also disables
preemption.  kmap_atomic() guarantees a unique mapping per CPU.

Interestingly, PPC does not use kmap_high_get() at all.

The ARM approach (kmap_high_l1_vipt()) recognizes that it is not always
desirable to keep IRQs disabled during DMA flushes, so the ARM
maintainers implemented a sort of "reentrant kmap_atomic()" that allows
multiple contexts to share the same pte by saving and restoring whatever
was there prior to the DMA sync.

Due to its use of kmap_high_get(), the ARM approach suffers from problem
#2, below.  It is also more complex and harder to test.  I'm not sure
how to recreate some of the worst corner cases, e.g. hardirq cacheflush
interrupts a softirq cacheflush which interrupted a user cacheflush.

Dezhong's patch was largely based on the ARM scheme, minus the
kmap_high_l1_vipt() logic.  I am sending an update to this patch which
attempts to imitate the PPC approach instead.  There is a considerable
amount of reuse since both strategies require similar modifications to
dma-default.c, in order to get that code to pass around "struct page"
pointers rather than kseg0 addresses.


PROBLEM #2:

Consider this sequence:

mm/highmem.c:kmap_high()
mm/highmem.c:map_new_virtual()
mm/highmem.c:flush_all_zero_pkmaps()
arch/mips/kernel/smp.c:flush_tlb_kernel_range()
kernel/softirq.c:on_each_cpu()
kernel/smp.c:smp_call_function()

The first thing kmap_high() does is lock_kmap(), which disables
interrupts on kmap_high_get() architectures:

#ifdef ARCH_NEEDS_KMAP_HIGH_GET
#define lock_kmap()             spin_lock_irq(&kmap_lock)
...
#else
#define lock_kmap()             spin_lock(&kmap_lock)
#endif

smp_call_function() may not be called with interrupts disabled:

 * You must not call this function with disabled interrupts or from a
 * hardware interrupt handler or from a bottom half handler.

So, on SMP, we get warnings like:

------------[ cut here ]------------                                            
WARNING: at kernel/smp.c:293 smp_call_function_single+0x17c/0x260()             
Modules linked in:
Call Trace:                                                                     
[<800160f0>] dump_stack+0x8/0x34                                                
[<8004a544>] warn_slowpath_common+0x78/0xa4                                     
[<8004a588>] warn_slowpath_null+0x18/0x24                                       
[<800876e4>] smp_call_function_single+0x17c/0x260                               
[<80087ca8>] smp_call_function+0x28/0x38                                        
[<800531b4>] on_each_cpu+0x1c/0x80                                              
[<80026d20>] flush_tlb_kernel_range+0x28/0x34                                   
[<800b95f4>] kmap_high+0x1dc/0x270                                              
[<8002a570>] __kmap+0x60/0x7c                                                   
[<801f15b4>] do_readpage+0x68/0x540                                             
[<801f218c>] ubifs_write_begin+0xbc/0x53c                                       
[<8009a558>] generic_perform_write+0xd4/0x1f0                                   
[<8009a6e4>] generic_file_buffered_write+0x70/0xbc                              
[<8009dd3c>] __generic_file_aio_write+0x2fc/0x600                               
[<8009e0b0>] generic_file_aio_write+0x70/0xf4                                   
[<800da864>] do_sync_write+0xc4/0x13c                                           
[<800db50c>] vfs_write+0xc0/0x168                                               
[<800db6ac>] sys_write+0x4c/0xa4                                                
[<80003d9c>] stack_done+0x20/0x3c                                               
                                                                                
---[ end trace 8abde6adefbcc81f ]---

I did some digging and found that ARM runs into the same problem.  For
processors that cannot "broadcast" TLB operations, SMP + HIGHMEM are
deemed incompatible:

http://www.spinics.net/lists/arm-kernel/msg74563.html

So, I would opt for the PPC approach in order to avoid this conflict.
Disabling SMP will result in a much nastier performance problem than
blocking interrupts during flushes.


PROBLEM #3:

Regarding David's flush_data_cache_page() concern:

http://www.linux-mips.org/archives/linux-mips/2008-03/msg00011.html


There are at least 4 reasons to flush the cache:

1) Boot time (cache contents are undefined at reset)

2) DMA coherence

3) I$/D$ coherence (self-modifying code)

4) Zap cache aliases


My interpretation of the code is that flush_dcache_page() is only called
for #3 and #4:

#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
static inline void flush_dcache_page(struct page *page)
{               
        if (cpu_has_dc_aliases || !cpu_has_ic_fills_f_dc)
                __flush_dcache_page(page);
                
}       


The fact that the L2 / board cache is not flushed in c-r4k.c leads me to
believe that #2 is not an intended use of this function:

static inline void local_r4k_flush_data_cache_page(void * addr)
{
        r4k_blast_dcache_page((unsigned long) addr);
}

Besides, we already have a standard, documented DMA API that should be
used instead.

Since this is not boot time, and HIGHMEM is incompatible with cache
aliases, that leaves #3 as a possible issue.  Is this function ever used
for I$/D$ coherence?

I should note that ARM has special handling in __flush_dcache_page() for
high pages.  IIRC, at one point they tried to support HIGHMEM on systems
with cache aliases, before deciding it was not feasible.  Maybe this
code is just a relic - or maybe it's really needed for something.

WARNING: multiple messages have this Message-ID (diff)
From: Kevin Cernekee <cernekee@gmail.com>
To: Ralf Baechle <ralf@linux-mips.org>
Cc: dediao@cisco.com, dvomlehn@cisco.com, linux-mips@linux-mips.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH 0/5] MIPS: HIGHMEM DMA on noncoherent MIPS32 processors
Date: Mon, 6 Sep 2010 21:03:42 -0700	[thread overview]
Message-ID: <ea0143cf318153e615b660cb0210f464@localhost> (raw)
Message-ID: <20100907040342.5QTlPxEkVPAnX7D4aBtZGFVzF_hrdmnJaAvnwOZ9bbg@z> (raw)

I had some time this weekend to revisit an old problem:

http://www.linux-mips.org/archives/linux-mips/2010-05/msg00009.html


PROBLEM #1:

When testing Dezhong's patch, I found that un-kmapped pages could get
passed into dma_map_sg().  The original patch did not handle this case.
This resulted in serious DMA coherency issues on my system, along the
lines of "/sbin/init segfaults on boot."

I found that this calling sequence, seen during SATA disk I/O, was the
culprit:

[<8002b3d4>] dma_map_sg+0x2ac/0x2b8                                             
[<802cdb70>] ata_qc_issue+0x274/0x37c                                           
[<802d383c>] ata_scsi_translate+0xb0/0x1e8                                      
[<802d7274>] ata_scsi_queuecmd+0xc8/0x2b8                                       
[<802b0630>] scsi_dispatch_cmd+0x118/0x2b8                                      
[<802b6c64>] scsi_request_fn+0x420/0x4d8                                        
[<80239550>] __blk_run_queue+0x84/0x19c                                         
[<802364a0>] elv_insert+0x158/0x2e4                                             
[<8023a398>] __make_request+0x11c/0x4d0                                         
[<80238144>] generic_make_request+0x348/0x4a0                                   
[<80238330>] submit_bio+0x94/0x13c                                              
[<80114024>] mpage_bio_submit+0x30/0x40                                         
[<80115a2c>] mpage_readpages+0x134/0x170                                        
[<800a4fd4>] __do_page_cache_readahead+0x214/0x314                              

__do_page_cache_readahead() allocates a bunch of pages for the I/O
requests, but I do not see them getting mapped to kernel addresses via
kmap() / kmap_atomic().  This explains why (PageHighMem(page) &&
!kmap_high_get(page)) can be true.

I do not think there is anything wrong with this behavior, so it seems
like dma_map_sg() will need to handle it somehow.  Looking at other
architectures:

The PPC approach (__dma_sync_page_highmem()) is to disable IRQs and call
kmap_atomic() to create a temporary mapping for the page.  Disabling
IRQs is necessary because it is possible (but not required) for the
dma_* functions to be invoked from interrupt context; it also disables
preemption.  kmap_atomic() guarantees a unique mapping per CPU.

Interestingly, PPC does not use kmap_high_get() at all.

The ARM approach (kmap_high_l1_vipt()) recognizes that it is not always
desirable to keep IRQs disabled during DMA flushes, so the ARM
maintainers implemented a sort of "reentrant kmap_atomic()" that allows
multiple contexts to share the same pte by saving and restoring whatever
was there prior to the DMA sync.

Due to its use of kmap_high_get(), the ARM approach suffers from problem
#2, below.  It is also more complex and harder to test.  I'm not sure
how to recreate some of the worst corner cases, e.g. hardirq cacheflush
interrupts a softirq cacheflush which interrupted a user cacheflush.

Dezhong's patch was largely based on the ARM scheme, minus the
kmap_high_l1_vipt() logic.  I am sending an update to this patch which
attempts to imitate the PPC approach instead.  There is a considerable
amount of reuse since both strategies require similar modifications to
dma-default.c, in order to get that code to pass around "struct page"
pointers rather than kseg0 addresses.


PROBLEM #2:

Consider this sequence:

mm/highmem.c:kmap_high()
mm/highmem.c:map_new_virtual()
mm/highmem.c:flush_all_zero_pkmaps()
arch/mips/kernel/smp.c:flush_tlb_kernel_range()
kernel/softirq.c:on_each_cpu()
kernel/smp.c:smp_call_function()

The first thing kmap_high() does is lock_kmap(), which disables
interrupts on kmap_high_get() architectures:

#ifdef ARCH_NEEDS_KMAP_HIGH_GET
#define lock_kmap()             spin_lock_irq(&kmap_lock)
...
#else
#define lock_kmap()             spin_lock(&kmap_lock)
#endif

smp_call_function() may not be called with interrupts disabled:

 * You must not call this function with disabled interrupts or from a
 * hardware interrupt handler or from a bottom half handler.

So, on SMP, we get warnings like:

------------[ cut here ]------------                                            
WARNING: at kernel/smp.c:293 smp_call_function_single+0x17c/0x260()             
Modules linked in:
Call Trace:                                                                     
[<800160f0>] dump_stack+0x8/0x34                                                
[<8004a544>] warn_slowpath_common+0x78/0xa4                                     
[<8004a588>] warn_slowpath_null+0x18/0x24                                       
[<800876e4>] smp_call_function_single+0x17c/0x260                               
[<80087ca8>] smp_call_function+0x28/0x38                                        
[<800531b4>] on_each_cpu+0x1c/0x80                                              
[<80026d20>] flush_tlb_kernel_range+0x28/0x34                                   
[<800b95f4>] kmap_high+0x1dc/0x270                                              
[<8002a570>] __kmap+0x60/0x7c                                                   
[<801f15b4>] do_readpage+0x68/0x540                                             
[<801f218c>] ubifs_write_begin+0xbc/0x53c                                       
[<8009a558>] generic_perform_write+0xd4/0x1f0                                   
[<8009a6e4>] generic_file_buffered_write+0x70/0xbc                              
[<8009dd3c>] __generic_file_aio_write+0x2fc/0x600                               
[<8009e0b0>] generic_file_aio_write+0x70/0xf4                                   
[<800da864>] do_sync_write+0xc4/0x13c                                           
[<800db50c>] vfs_write+0xc0/0x168                                               
[<800db6ac>] sys_write+0x4c/0xa4                                                
[<80003d9c>] stack_done+0x20/0x3c                                               
                                                                                
---[ end trace 8abde6adefbcc81f ]---

I did some digging and found that ARM runs into the same problem.  For
processors that cannot "broadcast" TLB operations, SMP + HIGHMEM are
deemed incompatible:

http://www.spinics.net/lists/arm-kernel/msg74563.html

So, I would opt for the PPC approach in order to avoid this conflict.
Disabling SMP will result in a much nastier performance problem than
blocking interrupts during flushes.


PROBLEM #3:

Regarding David's flush_data_cache_page() concern:

http://www.linux-mips.org/archives/linux-mips/2008-03/msg00011.html


There are at least 4 reasons to flush the cache:

1) Boot time (cache contents are undefined at reset)

2) DMA coherence

3) I$/D$ coherence (self-modifying code)

4) Zap cache aliases


My interpretation of the code is that flush_dcache_page() is only called
for #3 and #4:

#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
static inline void flush_dcache_page(struct page *page)
{               
        if (cpu_has_dc_aliases || !cpu_has_ic_fills_f_dc)
                __flush_dcache_page(page);
                
}       


The fact that the L2 / board cache is not flushed in c-r4k.c leads me to
believe that #2 is not an intended use of this function:

static inline void local_r4k_flush_data_cache_page(void * addr)
{
        r4k_blast_dcache_page((unsigned long) addr);
}

Besides, we already have a standard, documented DMA API that should be
used instead.

Since this is not boot time, and HIGHMEM is incompatible with cache
aliases, that leaves #3 as a possible issue.  Is this function ever used
for I$/D$ coherence?

I should note that ARM has special handling in __flush_dcache_page() for
high pages.  IIRC, at one point they tried to support HIGHMEM on systems
with cache aliases, before deciding it was not feasible.  Maybe this
code is just a relic - or maybe it's really needed for something.

             reply	other threads:[~2010-09-07  4:14 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-07  4:03 Kevin Cernekee [this message]
2010-09-07  4:03 ` [PATCH 0/5] MIPS: HIGHMEM DMA on noncoherent MIPS32 processors Kevin Cernekee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ea0143cf318153e615b660cb0210f464@localhost \
    --to=cernekee@gmail.com \
    --cc=dediao@cisco.com \
    --cc=dvomlehn@cisco.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@linux-mips.org \
    --cc=ralf@linux-mips.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.