Re: [RFC] textsearch infrastructure et al v2

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Thomas Graf <tgraf@suug.ch>
To: Pablo Neira <pablo@eurodev.net>
Cc: jamal <hadi@cyberus.ca>, netdev@oss.sgi.com
Subject: Re: [RFC] textsearch infrastructure et al v2
Date: Sat, 28 May 2005 15:58:13 +0200	[thread overview]
Message-ID: <20050528135813.GS15391@postel.suug.ch> (raw)
In-Reply-To: <42986A85.9060001@eurodev.net>

* Pablo Neira <42986A85.9060001@eurodev.net> 2005-05-28 14:56
> hm, i don't understand quite well, i bet that libqsearch was already 
> fragment-aware.

I'll rephrase, libqsearch, according to my understanding, is not able
to do optimized scanning over borders, i.e. it uses the same naive
approach as your proposal. This isn't necessarly a weakness, it
heavly depends on the length of the pattern and the amount of
fragments.

One of the major differences between libqsearch and the textsearch
I propose is that libqsearch implements "jokers", a subset of
what ts_regexp.c does. This heavly complicates their reference
implementation of bm. Again, this isn't necessarly a weakness, since
the framework doesn't require the algorithm to actually implement
these.

I cannot really tell which of the proposals we have right now
is the best/fastest/cleanest/fanciest/whatever. For that reason
I decided to transfer as much responsibility as possibly away
from the core into the actual algorithms, respectively made
it configureable via callback functions. It is still possible
to write a ts_(kmp|bm)_state.c which saves the state of the
scanning progress but other than in libqsearch this is not
a requirement.

> For small pattern and long packets, such pattern searching on the 
> fragment borders doesn't really hurt the performance.

Exactly.

> boyer-moore (BM) and other variants are definitely a must to have. I'm 
> still reading some papers about string matching and practical 
> applications (p2p traffic recognition based on string matching, ids, etc 
> etc) and the most interesting practical results come always from the use 
> of BM friends.

Absolultely, I implemented KMP because it doesn't require random
access to the text and thus serves well as a reference implementation.
I'll be glad to see your BM ported.

> I'm not familiar with those problems jamal has mentioned though, could 
> they really affect the string matching infrastructure in some way? I 
> truly prefer avoiding linearizing skb's.

I think we have to differ between non-linear skbs which we should have
without linearization and the scanning through real fragments.

In order to make some progress we have to answer these questions:

 * Do we want to be able to search on non-linear data in general?
   I say: yes

 * Do we want to be able to search on non-linear data which is
   not completely available and/or present at the time the search
   starts? In other words, do we want a search to be interruptable
   to continue later on. e.g. search over multiple skbs without
   queueing them up.
   I say: no

 * Do we want to provide random access to the text to be searched?
   If so, optionally or as a requirement? This merely has impact
   on a) BM could be efficiently implemented w/o naive scanning
   around the borders and b) requires the text segments to be
   completely maped/prepare at the risk they'll never be used. [0]
   I say: probably no

 [0] This impact can be limited by having the user specify strict
     searching area limits.

next prev parent reply	other threads:[~2005-05-28 13:58 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-05-27 22:47 [RFC] textsearch infrastructure et al v2 Thomas Graf
2005-05-27 22:47 ` [PATCH 1/5] [LIB] textsearch infrastructure Thomas Graf
2005-05-27 22:48 ` [PATCH 2/5] [LIB] Knuth-Morris-Pratt string-matching algorithm Thomas Graf
2005-05-27 22:48 ` [PATCH 3/5] [LIB] Naive regular expression " Thomas Graf
2005-05-27 22:48 ` [PATCH 4/5] [NET] Add skb_find_text() to search for a text pattern in skbs Thomas Graf
2005-05-28  3:11   ` Pablo Neira
2005-05-28 11:32     ` Thomas Graf
2005-05-27 22:49 ` [PATCH 5/5] [PKT_SCHED] textsearch ematch Thomas Graf
2005-05-28 11:59 ` [RFC] textsearch infrastructure et al v2 jamal
2005-05-28 12:35   ` Thomas Graf
2005-05-28 12:56     ` Pablo Neira
2005-05-28 12:58       ` Pablo Neira
2005-05-28 12:58       ` Pablo Neira
2005-05-28 13:58       ` Thomas Graf [this message]
2005-05-31 22:05       ` David S. Miller
2005-05-31 21:56 ` David S. Miller
2005-05-31 22:44   ` Thomas Graf
2005-05-31 22:50     ` David S. Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20050528135813.GS15391@postel.suug.ch \
    --to=tgraf@suug.ch \
    --cc=hadi@cyberus.ca \
    --cc=netdev@oss.sgi.com \
    --cc=pablo@eurodev.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).