netfilter.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Process Hang in __read_seqcount_begin
       [not found]             ` <1351270087.30380.9.camel@edumazet-glaptop>
@ 2012-10-26 18:51               ` Peter LaDow
  2012-10-26 20:53                 ` Thomas Gleixner
  2012-10-26 21:05                 ` Eric Dumazet
  0 siblings, 2 replies; 7+ messages in thread
From: Peter LaDow @ 2012-10-26 18:51 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, linux-rt-users, netfilter

(I've added netfilter and linux-rt-users to try to pull in more help).

On Fri, Oct 26, 2012 at 9:48 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Upstream kernel is fine, there is no race, as long as :
>
> local_bh_disable() disables BH and preemption.

Looking at the unpatched code in net/ipv4/netfilter/ip_tables.c, it
doesn't appear that any of the code checks the return value for
xt_write_receq_begin to determine if it is safe to write.  And neither
does the newly patched code.  How did the mainline code prevent
corruption of the tables it is updating?

Why isn't there something like:

  while ( (addend = xt_write_recseq_begin()) == 0 );

To make sure that only one person has write access to the tables?
Better yet, why not use a seqlock_t instead?

> Apparently RT changes this, so RT needs to change the code.

The RT patch only touches local_bh_disable/enable, not the code in
ip_tables.c.  Does the local_bh_disable/enable in the mainline code
protect against multiple writers?

> cmpxchg() has strong guarantees (and is also slower than upstream code),
> and seems a reasonable way to fix RT/iptables

I see this now.  And I agree that your patch would prevent corruption
of the sequence counter.

Thanks,
Pete

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-26 18:51               ` Process Hang in __read_seqcount_begin Peter LaDow
@ 2012-10-26 20:53                 ` Thomas Gleixner
  2012-10-26 21:05                 ` Eric Dumazet
  1 sibling, 0 replies; 7+ messages in thread
From: Thomas Gleixner @ 2012-10-26 20:53 UTC (permalink / raw)
  To: Peter LaDow; +Cc: Eric Dumazet, linux-kernel, linux-rt-users, netfilter

On Fri, 26 Oct 2012, Peter LaDow wrote:
> (I've added netfilter and linux-rt-users to try to pull in more help).
> 
> On Fri, Oct 26, 2012 at 9:48 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Upstream kernel is fine, there is no race, as long as :
> >
> > local_bh_disable() disables BH and preemption.
> 
> Looking at the unpatched code in net/ipv4/netfilter/ip_tables.c, it
> doesn't appear that any of the code checks the return value for
> xt_write_receq_begin to determine if it is safe to write.  And neither
> does the newly patched code.  How did the mainline code prevent
> corruption of the tables it is updating?
> 
> Why isn't there something like:
> 
>   while ( (addend = xt_write_recseq_begin()) == 0 );
> 
> To make sure that only one person has write access to the tables?
> Better yet, why not use a seqlock_t instead?
> 
> > Apparently RT changes this, so RT needs to change the code.
> 
> The RT patch only touches local_bh_disable/enable, not the code in
> ip_tables.c.  Does the local_bh_disable/enable in the mainline code
> protect against multiple writers?
> 
> > cmpxchg() has strong guarantees (and is also slower than upstream code),
> > and seems a reasonable way to fix RT/iptables
> 
> I see this now.  And I agree that your patch would prevent corruption
> of the sequence counter.

Thanks for the reminder. I'll have a look.

       tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-26 18:51               ` Process Hang in __read_seqcount_begin Peter LaDow
  2012-10-26 20:53                 ` Thomas Gleixner
@ 2012-10-26 21:05                 ` Eric Dumazet
  2012-10-26 21:25                   ` Peter LaDow
  1 sibling, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2012-10-26 21:05 UTC (permalink / raw)
  To: Peter LaDow; +Cc: linux-kernel, linux-rt-users, netfilter

On Fri, 2012-10-26 at 11:51 -0700, Peter LaDow wrote:
> (I've added netfilter and linux-rt-users to try to pull in more help).
> 
> On Fri, Oct 26, 2012 at 9:48 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Upstream kernel is fine, there is no race, as long as :
> >
> > local_bh_disable() disables BH and preemption.
> 
> Looking at the unpatched code in net/ipv4/netfilter/ip_tables.c, it
> doesn't appear that any of the code checks the return value for
> xt_write_receq_begin to determine if it is safe to write.  And neither
> does the newly patched code.  How did the mainline code prevent
> corruption of the tables it is updating?
> 

Do you know what is per cpu data in linux kernel ?

> Why isn't there something like
> 
>   while ( (addend = xt_write_recseq_begin()) == 0 );
> 
> To make sure that only one person has write access to the tables?
> Better yet, why not use a seqlock_t instead?
> 

Because its not needed. Really I dont know why you want that.

Once you are sure a thread cannot be interrupted by a softirq, and
cannot migrate to another cpu, access to percpu data doesnt need other
synchronization at all.

Following sequence is safe :

addend = (__this_cpu_read(xt_recseq.sequence) + 1) & 1;
/*
 * This is kind of a write_seqcount_begin(), but addend is 0 or 1
 * We dont check addend value to avoid a test and conditional jump,
 * since addend is most likely 1
 */
__this_cpu_add(xt_recseq.sequence, addend);

Because any other thread will use a different percpu data.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-26 21:05                 ` Eric Dumazet
@ 2012-10-26 21:25                   ` Peter LaDow
  2012-10-26 21:54                     ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Peter LaDow @ 2012-10-26 21:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, linux-rt-users, netfilter

On Fri, Oct 26, 2012 at 2:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Do you know what is per cpu data in linux kernel ?

I sorta did.  But since your response, I did more reading, and now I
see what you mean.  But I don't think this is a per cpu issue.  More
below.

> Because its not needed. Really I dont know why you want that.
>
> Once you are sure a thread cannot be interrupted by a softirq, and
> cannot migrate to another cpu, access to percpu data doesnt need other
> synchronization at all.

Because there are multiple entry points on the same CPU.  In
net/ipv4/netfilter/ip_tables, there are two entries to
xt_write_recseq_begin().  The first is in ipt_do_table and the other
is in get_counters.  Where we are seeing the lockup is with a
getsockopt syscall leading to do_counters.  The other path is through
ipt_do_table, which is installed as a hook.  I'm not sure from what
context the hooks are called, but it is certainly from a different
context than the syscall.

> Following sequence is safe :
>
> addend = (__this_cpu_read(xt_recseq.sequence) + 1) & 1;
> /*
>  * This is kind of a write_seqcount_begin(), but addend is 0 or 1
>  * We dont check addend value to avoid a test and conditional jump,
>  * since addend is most likely 1
>  */
> __this_cpu_add(xt_recseq.sequence, addend);

If this were safe, we wouldn't be seeing this lockup and your patch
wouldn't be needed.  So it seems that your patch doesn't really
address the issue that we are not "sure a thread cannot be interrupted
by a softirq, and cannot migrate to another cpu".  Well, we know it
cannot migrate to another CPU, because there isn't another CPU.  So
apparently, it can be interrupted by a softirq.  So local_bh_disable
isn't doing anything useful in the RT patches with regard to this.

As I mentioned earlier, I think perhaps what your patch did was ensure
an atomic update of the sequence counter.  But it does nothing to
synchronize two writers.  If they were already synchronized (such as
via the call to local_bh_disable), then we wouldn't see sequence
counter corruption, no?

Pete

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-26 21:25                   ` Peter LaDow
@ 2012-10-26 21:54                     ` Thomas Gleixner
  2012-10-30 23:09                       ` Peter LaDow
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2012-10-26 21:54 UTC (permalink / raw)
  To: Peter LaDow; +Cc: Eric Dumazet, linux-kernel, linux-rt-users, netfilter

On Fri, 26 Oct 2012, Peter LaDow wrote:
> On Fri, Oct 26, 2012 at 2:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> If this were safe, we wouldn't be seeing this lockup and your patch
> wouldn't be needed.  So it seems that your patch doesn't really
> address the issue that we are not "sure a thread cannot be interrupted
> by a softirq, and cannot migrate to another cpu".  Well, we know it
> cannot migrate to another CPU, because there isn't another CPU.  So
> apparently, it can be interrupted by a softirq.  So local_bh_disable
> isn't doing anything useful in the RT patches with regard to this.

RT changes the semantics slightly. And yes it's not prepared for stuff
which is relying on some of the magic mainline implicit semantics.

Let me have a look at the whole scenario, once I'm more awake.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-26 21:54                     ` Thomas Gleixner
@ 2012-10-30 23:09                       ` Peter LaDow
  2012-10-31  0:33                         ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Peter LaDow @ 2012-10-30 23:09 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: Eric Dumazet, linux-kernel, linux-rt-users, netfilter

Ok.  More of an update.  We've managed to create a scenario that
exhibits the problem much earlier.  We can now cause the lockup to
occur within a few hours (rather than the 12 to 24 hours in our other
scenario).

 Our setup is to to have a a lot of traffic constantly being processed
by the netfilter code.  After about 2 hours, any external attempt to
read the table entries (such as with getsocktopt and
IPT_SO_GET_ENTRIES) triggers the lockup.  What is strange is that this
does not appear until after a couple of hours of heavy traffic.  We
cannot trigger this problem in the first hour, rarely in the second
hour, and always after the second hour.

Now, our original setup did not have as much traffic.  But based upon
a quick, back of the napkin computation, it seems to occurring after a
certain amount of traffic.  I can try to get more firm numbers.  But
this kind of behavior hints less at a race condition between two
writers, and is instead somehow dependent upon the amount of traffic.
Indeed, my test program only uses IPT_SO_GET_ENTRIES which does not
trigger the second path do do_add_counters.  So I'm no longer thinking
the path through setsockopts is a cause of the problem.

So instead, it seems that the only way there could be multiple writers
(assuming that is the problem), is if there are multiple contexts
through which ipt_do_table() is called.  So far, my perusal of the
code indicates only through the hooks in each of the iptables modules.
 And it isn't clear to me how these are called.  But it does seem that
even with the patch Eric provided (which fixes the seqcount update),
there is still a potential problem.

If indeed we have multiple contexts executing ipt_do_table(), it is
possible for more than just the seqcount to be corrupted.  Indeed, it
seems that any updates to the internal structures could cause
problems.  It isn't clear to me if there is anything modified here,
other than the counters, so I'm not sure if there are any other
issues.  But regardless, if the counters could become corrupted, then
it is possible to break any rules that use them.

Anyway, based on earlier discussion, is there any reason not to use a
lock (presuming any solution properly takes into account possible
recursion)?  I understand that the mainline is protected, but perhaps
in the RT version we can use seqlock (and prevent a recursive lock)?

Thanks,
Pete LaDow

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Process Hang in __read_seqcount_begin
  2012-10-30 23:09                       ` Peter LaDow
@ 2012-10-31  0:33                         ` Thomas Gleixner
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Gleixner @ 2012-10-31  0:33 UTC (permalink / raw)
  To: Peter LaDow; +Cc: Eric Dumazet, linux-kernel, linux-rt-users, netfilter

On Tue, 30 Oct 2012, Peter LaDow wrote:
> Anyway, based on earlier discussion, is there any reason not to use a
> lock (presuming any solution properly takes into account possible
> recursion)?  I understand that the mainline is protected, but perhaps
> in the RT version we can use seqlock (and prevent a recursive lock)?

Have you tried 3.6.3-rt9?

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-10-31  0:33 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAN8Q1EdobyFuAvtajGRSqNsCG+XoPA+FkwN=2j_H126mmfEjTQ@mail.gmail.com>
     [not found] ` <1350925299.8609.978.camel@edumazet-glaptop>
     [not found]   ` <CAN8Q1EeqmHMHD-VUgr7Wzv0Z_DAD1SrEniftg==L14d+hGwu+Q@mail.gmail.com>
     [not found]     ` <CAN8Q1EcTRsHjCJD90q0bKRJFMvi7Vn8YRsyQTAHNtTFVuuNdWQ@mail.gmail.com>
     [not found]       ` <CAN8Q1EeB2kC6A_YP9zdtp_qYHMG5FF7-Usjx7Rz5pv--bjhqkQ@mail.gmail.com>
     [not found]         ` <1351053164.6537.95.camel@edumazet-glaptop>
     [not found]           ` <CAN8Q1Ef9sCAG8-KNWXEeuRUcmFNcg93mMG=HRQw6i3UuGtJCVg@mail.gmail.com>
     [not found]             ` <1351270087.30380.9.camel@edumazet-glaptop>
2012-10-26 18:51               ` Process Hang in __read_seqcount_begin Peter LaDow
2012-10-26 20:53                 ` Thomas Gleixner
2012-10-26 21:05                 ` Eric Dumazet
2012-10-26 21:25                   ` Peter LaDow
2012-10-26 21:54                     ` Thomas Gleixner
2012-10-30 23:09                       ` Peter LaDow
2012-10-31  0:33                         ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).