* [PATCH] firewire: net: fix panic in fwnet_write_complete [not found] ` <4B547154.40700@s5r6.in-berlin.de> @ 2010-01-18 21:36 ` Stefan Richter 2010-01-19 0:42 ` Stefan Richter 0 siblings, 1 reply; 5+ messages in thread From: Stefan Richter @ 2010-01-18 21:36 UTC (permalink / raw) To: Илья Басин Cc: linux1394-devel, linux-kernel Date: From: Stefan Richter <stefanr@s5r6.in-berlin.de> Subject: firewire: net: fix panic in fwnet_write_complete In the transmit path of firewire-net (IPv4 over 1394), the following race condition may occur: - The networking soft IRQ inserts a datagram into the 1394 async request transmit DMA. - The 1394 async transmit completion tasklet runs to finish cleaning up (unlink datagram from list of pending ones, release skb and outbound 1394 transaction object) --- before the networking soft IRQ had a chance to proceed and add the datagram to the list of pending datagrams. This caused a panic in the 1394 async transmit completion tasklet when it dereferenced unitialized list heads: http://bugzilla.kernel.org/show_bug.cgi?id=15077 The fix is to add checks in the tx soft RQ and in the tasklet to determine who of these two is the last referrer to the transaction object. Then handle the cleanup of the object by the last referrer rather than assuming that the tasklet is always the last one. There is another similar race: Between said tasklet and fwnet_close, i.e. at ifdown. However, that race is much less likely to occur in practice and shall be fixed in a separate update. Reported-by: Илья Басин <basinilya@gmail.com> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> --- Илья, could you give this a try? drivers/firewire/net.c | 53 ++++++++++++++++++++++++++++++----------- 1 file changed, 39 insertions(+), 14 deletions(-) Index: linux-2.6.32.2/drivers/firewire/net.c =================================================================== --- linux-2.6.32.2.orig/drivers/firewire/net.c +++ linux-2.6.32.2/drivers/firewire/net.c @@ -893,20 +893,31 @@ static void fwnet_receive_broadcast(stru static struct kmem_cache *fwnet_packet_task_cache; +static void fwnet_free_ptask(struct fwnet_packet_task *ptask) +{ + dev_kfree_skb_any(ptask->skb); + kmem_cache_free(fwnet_packet_task_cache, ptask); +} + static int fwnet_send_packet(struct fwnet_packet_task *ptask); static void fwnet_transmit_packet_done(struct fwnet_packet_task *ptask) { - struct fwnet_device *dev; + struct fwnet_device *dev = ptask->dev; unsigned long flags; - - dev = ptask->dev; + bool free; spin_lock_irqsave(&dev->lock, flags); - list_del(&ptask->pt_link); - spin_unlock_irqrestore(&dev->lock, flags); - ptask->outstanding_pkts--; /* FIXME access inside lock */ + ptask->outstanding_pkts--; + + /* Check whether we or the networking TX soft-IRQ is last user. */ + free = (ptask->outstanding_pkts == 0 && !list_empty(&ptask->pt_link)); + + if (ptask->outstanding_pkts == 0) + list_del(&ptask->pt_link); + + spin_unlock_irqrestore(&dev->lock, flags); if (ptask->outstanding_pkts > 0) { u16 dg_size; @@ -951,10 +962,10 @@ static void fwnet_transmit_packet_done(s ptask->max_payload = skb->len + RFC2374_FRAG_HDR_SIZE; } fwnet_send_packet(ptask); - } else { - dev_kfree_skb_any(ptask->skb); - kmem_cache_free(fwnet_packet_task_cache, ptask); } + + if (free) + fwnet_free_ptask(ptask); } static void fwnet_write_complete(struct fw_card *card, int rcode, @@ -977,6 +988,7 @@ static int fwnet_send_packet(struct fwne unsigned tx_len; struct rfc2734_header *bufhdr; unsigned long flags; + bool free; dev = ptask->dev; tx_len = ptask->max_payload; @@ -1022,12 +1034,16 @@ static int fwnet_send_packet(struct fwne generation, SCODE_100, 0ULL, ptask->skb->data, tx_len + 8, fwnet_write_complete, ptask); - /* FIXME race? */ spin_lock_irqsave(&dev->lock, flags); - list_add_tail(&ptask->pt_link, &dev->broadcasted_list); + + /* If the AT tasklet already ran, we may be last user. */ + free = (ptask->outstanding_pkts == 0 && list_empty(&ptask->pt_link)); + if (!free) + list_add_tail(&ptask->pt_link, &dev->broadcasted_list); + spin_unlock_irqrestore(&dev->lock, flags); - return 0; + goto out; } fw_send_request(dev->card, &ptask->transaction, @@ -1035,12 +1051,19 @@ static int fwnet_send_packet(struct fwne ptask->generation, ptask->speed, ptask->fifo_addr, ptask->skb->data, tx_len, fwnet_write_complete, ptask); - /* FIXME race? */ spin_lock_irqsave(&dev->lock, flags); - list_add_tail(&ptask->pt_link, &dev->sent_list); + + /* If the AT tasklet already ran, we may be last user. */ + free = (ptask->outstanding_pkts == 0 && list_empty(&ptask->pt_link)); + if (!free) + list_add_tail(&ptask->pt_link, &dev->sent_list); + spin_unlock_irqrestore(&dev->lock, flags); dev->netdev->trans_start = jiffies; + out: + if (free) + fwnet_free_ptask(ptask); return 0; } @@ -1298,6 +1321,8 @@ static netdev_tx_t fwnet_tx(struct sk_bu spin_unlock_irqrestore(&dev->lock, flags); ptask->max_payload = max_payload; + INIT_LIST_HEAD(&ptask->pt_link); + fwnet_send_packet(ptask); return NETDEV_TX_OK; -- Stefan Richter -=====-==-=- ---= =--=- http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] firewire: net: fix panic in fwnet_write_complete 2010-01-18 21:36 ` [PATCH] firewire: net: fix panic in fwnet_write_complete Stefan Richter @ 2010-01-19 0:42 ` Stefan Richter 2010-01-26 20:09 ` Stefan Richter 0 siblings, 1 reply; 5+ messages in thread From: Stefan Richter @ 2010-01-19 0:42 UTC (permalink / raw) To: Илья Басин Cc: linux1394-devel, linux-kernel Stefan Richter wrote: > The fix is to add checks in the tx soft RQ and in the tasklet to > determine who of these two is the last referrer to the transaction > object. Then handle the cleanup of the object by the last referrer > rather than assuming that the tasklet is always the last one. ... > Илья, could you give this a try? My own testing on a dual core box --- peered with another Linux box which ran the older eth1394 --- worked OK so far for transfers of massive files (> 4 GiB) back and forth via FTP and ssh running on a text console. But in my first attempt to use FTP on X11 --- i.e. with more concurrent interrupt sources --- the firewire-net box crashed very soon. In that test I used Dolphin of KDE as FTP client, and the crash already happened after Dolphin had loaded and displayed the remote home directory and was peeking into files for preview data. I got the following trace: ------------: cut here ]------------ kernel: BUG at mm/slab.c:2885! invalid: opcode: 0000 [#1] PREEMPT: SMP: DEBUG_PAGEALLOC: last: sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:03:03.0/fw0/units Modules: linked in: firewire_net: firewire_ohci: firewire_core: netconsole: nfs: lockd: sunrpc: i915: drm_kms_helper: drm: i2c_algo_bit: snd_hda_codec_idt: snd_hda_intel: snd_hda_codec: applesmc: led_class: input_polldev: snd_pcm: rtc: i2c_i801: snd_timer: sg: sky2: snd: video: backlight: snd_page_alloc: thermal: output: button: Pid: 4267, comm: kio_thumbnail Not tainted 2.6.33-rc4 #2 Mac-F4208EC8/Macmini1,1 EIP: 0060:[<c1080e4f>] EFLAGS: 00010006 CPU: 0 EIP: is at cache_free_debugcheck+0x1e8/0x2e8 EAX: f8efddca EBX: f2008000 ECX: 65727661 EDX: 00c38e0a ESI: d84156c5 EDI: f707fe40 EBP: f07fddc0 ESP: f07fdd80 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process: kio_thumbnail (pid: 4267, ti=f07fc000 task=f079b6e8 task.ti=f07fc000) Stack: f2008e20: 00000000: 0000000c: c11c6f96: 00640100: f2008000: 635688c0: d84156c5: kernel: f2008ec0: 00000082: f2008e18: 00000000: 00000000: f715ff30: f707fe40: f2008e20: kernel: f07fddd8: c108137c: 00000282: f2008e20: f2008e3c: 00000060: f07fdde4: c11c6f96: Call: Trace: ? __kfree_skb+0x6e/0x71 ? kmem_cache_free+0x56/0xb0 ? __kfree_skb+0x6e/0x71 ? kfree_skb+0x2b/0x2d ? unix_stream_recvmsg+0x3c3/0x48d ? file_read_actor+0x74/0xcc ? sock_aio_read+0xf2/0x107 ? do_sync_read+0x89/0xc7 This was all that made it out via netconsole. This was on 2.6.33-rc4 and I think I am going back to 2.6.32.y to eliminate unrelated .33-rc flakiness from my testing. -- Stefan Richter -=====-==-=- ---= =--== http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] firewire: net: fix panic in fwnet_write_complete 2010-01-19 0:42 ` Stefan Richter @ 2010-01-26 20:09 ` Stefan Richter 2010-01-31 9:45 ` Re[2]: " Илья Басин 0 siblings, 1 reply; 5+ messages in thread From: Stefan Richter @ 2010-01-26 20:09 UTC (permalink / raw) To: Илья Басин Cc: linux1394-devel, linux-kernel Stefan Richter wrote: > My own testing on a dual core box --- peered with another Linux box > which ran the older eth1394 --- worked OK so far for transfers of > massive files (> 4 GiB) back and forth via FTP and ssh running on a text > console. > > But in my first attempt to use FTP on X11 --- i.e. with more concurrent > interrupt sources --- the firewire-net box crashed very soon. In that > test I used Dolphin of KDE as FTP client, and the crash already happened > after Dolphin had loaded and displayed the remote home directory and was > peeking into files for preview data. I got the following trace: > > ------------: cut here ]------------ > kernel: BUG at mm/slab.c:2885! [...] > EIP: is at cache_free_debugcheck+0x1e8/0x2e8 [...] > Call: Trace: > ? __kfree_skb+0x6e/0x71 > ? kmem_cache_free+0x56/0xb0 > ? __kfree_skb+0x6e/0x71 > ? kfree_skb+0x2b/0x2d > ? unix_stream_recvmsg+0x3c3/0x48d > ? file_read_actor+0x74/0xcc > ? sock_aio_read+0xf2/0x107 > ? do_sync_read+0x89/0xc7 Hi Илья, I am going to send a pull request for some other unrelated firewire fixes to Linus about tomorrow. firewire-net on the other hand needs still more work than my fwnet_write_complete patch since you and I now get these kmem cache corruption related bugs. What is your impression --- does this first incomplete fix decrease the likelihood of crashes enough to make it worth to include it in a pull request already? I haven't done more extensive firewire-net tests since last week yet, hence it is hard to tell for me how severe the new issue is in practical use. (Also, I have no idea yet whether I will be quick or slow to find this other problem and whether it can be fixed in a manner that is suitable for a mainline merge before 2.6.33 is going to be released.) -- Stefan Richter -=====-==-=- ---= ==-=- http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re[2]: [PATCH] firewire: net: fix panic in fwnet_write_complete 2010-01-26 20:09 ` Stefan Richter @ 2010-01-31 9:45 ` Илья Басин 2010-01-31 10:31 ` Stefan Richter 0 siblings, 1 reply; 5+ messages in thread From: Илья Басин @ 2010-01-31 9:45 UTC (permalink / raw) To: Stefan Richter; +Cc: linux1394-devel, linux-kernel SR> Stefan Richter wrote: >> My own testing on a dual core box --- peered with another Linux box >> which ran the older eth1394 --- worked OK so far for transfers of >> massive files (> 4 GiB) back and forth via FTP and ssh running on a text >> console. >> >> But in my first attempt to use FTP on X11 --- i.e. with more concurrent >> interrupt sources --- the firewire-net box crashed very soon. In that >> test I used Dolphin of KDE as FTP client, and the crash already happened >> after Dolphin had loaded and displayed the remote home directory and was >> peeking into files for preview data. I got the following trace: >> >> ------------: cut here ]------------ >> kernel: BUG at mm/slab.c:2885! SR> [...] >> EIP: is at cache_free_debugcheck+0x1e8/0x2e8 SR> [...] >> Call: Trace: >> ? __kfree_skb+0x6e/0x71 >> ? kmem_cache_free+0x56/0xb0 >> ? __kfree_skb+0x6e/0x71 >> ? kfree_skb+0x2b/0x2d >> ? unix_stream_recvmsg+0x3c3/0x48d >> ? file_read_actor+0x74/0xcc >> ? sock_aio_read+0xf2/0x107 >> ? do_sync_read+0x89/0xc7 SR> Hi Илья, SR> I am going to send a pull request for some other unrelated firewire SR> fixes to Linus about tomorrow. SR> firewire-net on the other hand needs still more work than my SR> fwnet_write_complete patch since you and I now get these kmem cache SR> corruption related bugs. SR> What is your impression --- does this first incomplete fix decrease the SR> likelihood of crashes enough to make it worth to include it in a pull SR> request already? I haven't done more extensive firewire-net tests since SR> last week yet, hence it is hard to tell for me how severe the new issue SR> is in practical use. SR> (Also, I have no idea yet whether I will be quick or slow to find this SR> other problem and whether it can be fixed in a manner that is suitable SR> for a mainline merge before 2.6.33 is going to be released.) Hi. Have just found 4 letters from you in gmail spam bin, including this one. Your fix does decrease the chance of crash. Without the fix crashes randomly occured within 3 minutes. With the fix - only after 10 minutes of copying. These 2 bugs are probably unrelated. -- ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] firewire: net: fix panic in fwnet_write_complete 2010-01-31 9:45 ` Re[2]: " Илья Басин @ 2010-01-31 10:31 ` Stefan Richter 0 siblings, 0 replies; 5+ messages in thread From: Stefan Richter @ 2010-01-31 10:31 UTC (permalink / raw) To: Илья Басин Cc: linux1394-devel, linux-kernel Илья Басин wrote: > SR> Stefan Richter wrote: > SR> I am going to send a pull request for some other unrelated firewire > SR> fixes to Linus about tomorrow. [...] > SR> What is your impression --- does this first incomplete fix decrease the > SR> likelihood of crashes enough to make it worth to include it in a pull > SR> request already? I haven't done more extensive firewire-net tests since > SR> last week yet, hence it is hard to tell for me how severe the new issue > SR> is in practical use. [...] > Hi. Have just found 4 letters from you in gmail spam bin, including > this one. (One reason may be that my mail service provider occasionally gets blacklisted in SORBS, perhaps other DNSBLs. Or they distrust kyrillic letters in mails from .de or whatever.) > Your fix does decrease the chance of crash. Without the fix > crashes randomly occured within 3 minutes. With the fix - only after > 10 minutes of copying. These 2 bugs are probably unrelated. OK. The patch didn't make it into 2.6.33-rc6 from last Friday but I will see to it that it gets merged before the final release. I will also try to find time to learn more about the other bug. -- Stefan Richter -=====-==-=- ---= ===== http://arcgraph.de/sr/ ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-01-31 10:31 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <574312453.20100114002526@gmail.com>
[not found] ` <4B4E4D73.5060100@s5r6.in-berlin.de>
[not found] ` <4B4E4FD7.9060403@s5r6.in-berlin.de>
[not found] ` <4B53291A.5050901@s5r6.in-berlin.de>
[not found] ` <1833218156.20100118151651@gmail.com>
[not found] ` <4B547154.40700@s5r6.in-berlin.de>
2010-01-18 21:36 ` [PATCH] firewire: net: fix panic in fwnet_write_complete Stefan Richter
2010-01-19 0:42 ` Stefan Richter
2010-01-26 20:09 ` Stefan Richter
2010-01-31 9:45 ` Re[2]: " Илья Басин
2010-01-31 10:31 ` Stefan Richter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox