[RFC] textsearch infrastructure et al v2

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] textsearch infrastructure et al v2
@ 2005-05-27 22:47 Thomas Graf
  2005-05-27 22:47 ` [PATCH 1/5] [LIB] textsearch infrastructure Thomas Graf
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Thomas Graf @ 2005-05-27 22:47 UTC (permalink / raw)
  To: netdev; +Cc: Jamal Hadi Salim

Dave, I'm going through another RFC round to hopefully get
some comments on the notes below.

I'm not yet satistifed with the core infrastructure wrt to
non-linear data, especially for complex cases such as non
linear skbs. The current way can be easly described as:

search_occurrance()
  search-algorithm()
    while get-next-segment()
      process-segment()

So basically we save the state of the segment fetching
instead of saving the state of the search algorithm which
would be way too complex for things like regular expression
string matching.

In the case of non linear skbs this would lead to:

1) start of search algorithm
2) first call to get_text(), return skb->data
3) processing of the text segment, return if match found
4) second call to get_text(), map and return first fragment
5) processing ...
6) third call to get_text(), unmap first fragment, map and
   return second fragment
7) processing ...
n) etc.

Assuming that the most common case is to have no fragments
at all or if we only process a very limited number of
fragments (due to restricted search area) this should be
quite fast. However, the complexity of get_text() for the
skb case including the map/unmap of the fragment interrupts
the search flow quite a bit which hurts performance.

In order to address this issue I'm having second thoughts
about whether to allow the textsearch user to provide an
array of text segments including their size, i.e. the skb
fragments could be maped in prior and the algorithm could
scan through it like nothing. ;-> However, this has one
essential drawback, we might map fragments which we'll
never use so this can be slower in the end.

Any other ideas around?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/5] [LIB] textsearch infrastructure
  2005-05-27 22:47 [RFC] textsearch infrastructure et al v2 Thomas Graf
@ 2005-05-27 22:47 ` Thomas Graf
  2005-05-27 22:48 ` [PATCH 2/5] [LIB] Knuth-Morris-Pratt string-matching algorithm Thomas Graf
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Thomas Graf @ 2005-05-27 22:47 UTC (permalink / raw)
  To: netdev; +Cc: Jamal Hadi Salim

Before a textsearch can be performed, a configuration must be created
by calling textsearch_prepare() specyfing the searching algorithm and
the pattern to match. The returned configuration may then be used for
an arbitary amount of times and even in parallel as long as you
provide a separate struct ts_state variable for every instance.

The actual search is performed by either calling textsearch_find_-
continuous() for linear text or by providing an own get_text()
implementation and calling textsearch_find(). Both functions will
returns the position of the first matching occurrence of the patern.
Subsequent matches may be retrived via textsearch_next() irrespective
of the linearity of the text.

Once you're done using a configuration you must give back the
allocted resources by calling textsearch_destroy().

Signed-off-by: Thomas Graf <tgraf@suug.ch>

---
commit bf7ae763f13d767bd039703b3ab4f5954561df39
tree ab065819ea6e966aa3db4f1c5935c421dd689d2e
parent fde09d4989b6c7b5183ffb3ec2076ff53dd6fa78
author Thomas Graf <tgraf@suug.ch> Fri, 27 May 2005 20:18:48 +0200
committer Thomas Graf <tgraf@suug.ch> Fri, 27 May 2005 20:18:48 +0200

 include/linux/textsearch.h |  196 ++++++++++++++++++++++++++++++++++++
 lib/Makefile               |    2 
 lib/textsearch.c           |  241 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 438 insertions(+), 1 deletion(-)

Index: include/linux/textsearch.h
===================================================================
--- /dev/null  (tree:b032d0d440d93aac252e656bd41df32ff5461e3a)
+++ ab065819ea6e966aa3db4f1c5935c421dd689d2e/include/linux/textsearch.h  (mode:100644)
@@ -0,0 +1,196 @@
+#ifndef __LINUX_TEXTSEARCH_H
+#define __LINUX_TEXTSEARCH_H
+
+#ifdef __KERNEL__
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/err.h>
+
+struct ts_config;
+
+/**
+ * struct ts_state - textsearch state
+ * @offset: offset for next match
+ * @args: for persistant variables of get_text()
+ */
+struct ts_state
+{
+	int			offset;
+	long			args[6];
+};
+
+/**
+ * struct ts_ops - textsearch operations
+ * @name: name of search algorithm
+ * @init: called upon initialization to prepare a search
+ * @find: does the actual matching on a prepared configuration
+ * @destroy: called upon destroyage of a configuration
+ * @owner: reference to algorithm module if existent
+ * @list: operations list we are on
+ */
+struct ts_ops
+{
+	const char		*name;
+	struct ts_config *	(*init)(const unsigned char *, size_t, int);
+	int			(*find)(struct ts_config *,
+					struct ts_state *);
+	void			(*destroy)(struct ts_config *);
+	unsigned char *		(*get_pattern)(struct ts_config *);
+	unsigned int		(*get_pattern_len)(struct ts_config *);
+	struct module		*owner;
+	struct list_head	list;
+};
+
+/**
+ * struct ts_config - textsearch configuration
+ * @ops: textsearch operations
+ * @get_text: callback to fetch text to search in
+ * @len: length of pattern
+ */
+struct ts_config
+{
+	struct ts_ops		*ops;
+
+	/**
+	 * get_text - return next chunk of text
+	 * @offset: Number of bytes consumed by the matcher
+	 * @dst: destination buffer
+	 * @conf: search configuration
+	 * @state: search state
+	 *
+	 * Gets called repeatedly until 0 is returned.Must assign a pointer
+	 * to the start of the next chunk of text to *dst and return the
+	 * length of the chunk or 0 if at the end. offset==0 indicates
+	 * a new search. May store/read persistant values in state->args[].
+	 */
+	int			(*get_text)(int offset, unsigned char **dst,
+					    struct ts_config *conf,
+					    struct ts_state *state);
+
+	/**
+	 * finish - called when the matching has been completed successful
+	 *          or not.
+	 * @conf: search configuration
+	 * @state: search state
+	 */
+	void			(*finish)(struct ts_config *conf,
+					  struct ts_state *state);
+};
+
+/**
+ * textsearch_next - continue search for a pattern in a text
+ * @conf: search configuration
+ * @state: search state
+ *
+ * Continues a search looking for more occurrences of the pattern.
+ * You must call textsearch_find() to search the first occurrence
+ * in order to reset the state.
+ *
+ * Returns the position of next occurance of the pattern or a
+ *         negative number if no match was found.
+ */ 
+static inline int textsearch_next(struct ts_config *conf,
+				  struct ts_state *state)
+{
+	int ret = conf->ops->find(conf, state);
+
+	if (conf->finish)
+		conf->finish(conf, state);
+
+	return ret;
+}
+
+/**
+ * textsearch_find - search for a pattern in a text
+ * @conf: search configuration
+ * @state: search state
+ *
+ * Peforms the actual search on the prepared configuration.
+ *
+ * Returns the position of first occurrence of the pattern or a
+ *         negative number if no match was found.
+ */ 
+static inline int textsearch_find(struct ts_config *conf,
+				  struct ts_state *state)
+{
+	state->offset = 0;
+	return textsearch_next(conf, state);
+}
+
+/**
+ * textsearch_destroy - destroy a textsearch configuration
+ * @conf: search configuration
+ *
+ * Releases all references of the configuration and frees
+ * up the memory.
+ */
+static inline void textsearch_destroy(struct ts_config *conf)
+{
+	if (conf->ops) {
+		if (conf->ops->destroy)
+			conf->ops->destroy(conf);
+		module_put(conf->ops->owner);
+	}
+
+	kfree(conf);
+}
+
+/**
+ * textsearch_get_pattern - return pattern of a textsearch configuration
+ * @conf: textsearch configuration
+ */
+static inline unsigned char *textsearch_get_pattern(struct ts_config *conf)
+{
+	return conf->ops->get_pattern(conf);
+}
+
+/**
+ * textsearch_get_pattern_len - return pattern length of a textsearch configuration
+ * @conf: textsearch configuration
+ */
+static inline unsigned int textsearch_get_pattern_len(struct ts_config *conf)
+{
+	return conf->ops->get_pattern_len(conf);
+}
+
+/**
+ * alloc_ts_config - allocate a textsearch configuration
+ * @payload: size of additional module specific data required
+ * @gfp_mask: allocation mask
+ *
+ * Returns a empty textsearch configuration or a ERR_PTR().
+ */
+static inline struct ts_config *alloc_ts_config(size_t payload, int gfp_mask)
+{
+	struct ts_config *conf;
+
+	conf = kmalloc(sizeof(*conf) + payload, gfp_mask);
+	if (conf == NULL)
+		return ERR_PTR(-ENOMEM);
+
+	memset(conf, 0, sizeof(*conf) + payload);
+	return conf;
+}
+
+/**
+ * ts_config_priv - returns private part of a textsearch configuration
+ * @conf: textsearch configuration
+ */
+static inline void *ts_config_priv(struct ts_config *conf)
+{
+	return ((unsigned char *) conf + sizeof(struct ts_config));
+}
+
+extern int textsearch_register(struct ts_ops *);
+extern int textsearch_unregister(struct ts_ops *);
+extern struct ts_config *textsearch_prepare(const char *, const unsigned char *,
+					    size_t, int, int);
+extern int textsearch_find_continuous(struct ts_config *, struct ts_state *,
+				      const unsigned char *, size_t);
+
+#endif /* __KERNEL__ */
+
+#endif
Index: lib/Makefile
===================================================================
--- b032d0d440d93aac252e656bd41df32ff5461e3a/lib/Makefile  (mode:100644)
+++ ab065819ea6e966aa3db4f1c5935c421dd689d2e/lib/Makefile  (mode:100644)
@@ -6,7 +6,7 @@
 	 bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \
 	 kobject.o kref.o idr.o div64.o int_sqrt.o \
 	 bitmap.o extable.o kobject_uevent.o prio_tree.o sha1.o \
-	 halfmd4.o
+	 halfmd4.o textsearch.o
 
 obj-y += sort.o parser.o
 
Index: lib/textsearch.c
===================================================================
--- /dev/null  (tree:b032d0d440d93aac252e656bd41df32ff5461e3a)
+++ ab065819ea6e966aa3db4f1c5935c421dd689d2e/lib/textsearch.c  (mode:100644)
@@ -0,0 +1,241 @@
+/*
+ * lib/textsearch.c	Generic text search interface
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ *
+ * ==========================================================================
+ *
+ * The textsearch infrastructure provides text searching facitilies for both
+ * linear and non-linear data.
+ *
+ * Before a textsearch can be performed, a configuration must be created
+ * by calling textsearch_prepare() specyfing the searching algorithm and
+ * the pattern to match. The returned configuration may then be used for
+ * an arbitary amount of times and even in parallel as long as you
+ * provide a separate struct ts_state variable for every instance.
+ *
+ * The actual search is performed by either calling textsearch_find_-
+ * continuous() for linear text or by providing an own get_text()
+ * implementation and calling textsearch_find(). Both functions will
+ * returns the position of the first matching occurrence of the patern.
+ * Subsequent matches may be retrived via textsearch_next() irrespective
+ * of the linearity of the text.
+ *
+ * Once you're done using a configuration you must give back the
+ * allocted resources by calling textsearch_destroy().
+ *
+ * Example:
+ *   int pos;
+ *   struct ts_config *conf;
+ *   struct ts_state state;
+ *   conar char *pattern = "chicken";
+ *   const char *example = "We dance the funky chicken";
+ *
+ *   conf = textsearch_prepare("kmp", pattern, strlen(pattern), GFP_KERNEL, 1);
+ *   if (IS_ERR(conf)) {
+ *       err = PTR_ERR(conf);
+ *       goto errout;
+ *   }
+ *
+ *   pos = textsearch_find_continuous(conf, &state, example, strlen(example));
+ *   if (pos >= 0)
+ *       panic("Oh my god, dancing chickens at %d\n", pos);
+ *
+ *   textsearch_destroy(conf);
+ *
+ * ==========================================================================
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/err.h>
+#include <linux/textsearch.h>
+
+static LIST_HEAD(ts_ops);
+static DEFINE_RWLOCK(ts_mod_lock);
+
+static inline struct ts_ops *lookup_ts_algo(const char *name)
+{
+	struct ts_ops *o;
+
+	read_lock(&ts_mod_lock);
+	list_for_each_entry(o, &ts_ops, list) {
+		if (!strcmp(name, o->name)) {
+			if (!try_module_get(o->owner))
+				o = NULL;
+			read_unlock(&ts_mod_lock);
+			return o;
+		}
+	}
+	read_unlock(&ts_mod_lock);
+
+	return NULL;
+}
+
+/**
+ * textsearch_register - register a textsearch module
+ * @ops: operations lookup table
+ *
+ * This function must be called by textsearch modules to announce
+ * their presence. The given @ops must have %name set to a unique
+ * identifier and the callbacks find() and init() must be implemented.
+ *
+ * Returns -EEXISTS if another module has already registered with
+ *         same name.
+ */
+int textsearch_register(struct ts_ops *ops)
+{
+	int err = -EEXIST;
+	struct ts_ops *o;
+
+	if (ops->name == NULL || ops->find == NULL || ops->init == NULL ||
+	    ops->get_pattern == NULL || ops->get_pattern_len == NULL)
+		return -EINVAL;
+
+	write_lock(&ts_mod_lock);
+	list_for_each_entry(o, &ts_ops, list)
+		if (!strcmp(ops->name, o->name))
+			goto errout;
+
+	list_add_tail(&ops->list, &ts_ops);
+	err = 0;
+errout:
+	write_unlock(&ts_mod_lock);
+	return err;
+}
+
+/**
+ * textsearch_unregister - unregister a textsearch module
+ * @ops: operations lookup table
+ *
+ * This function must be called by textsearch modules to announce
+ * their disappearance for examples when the module gets unloaded.
+ * The @ops parameter must be the same as the one during the
+ * registration.
+ *
+ * Returns -ENOENT if no matching textsearch registration was found.
+ */
+int textsearch_unregister(struct ts_ops *ops)
+{
+	int err = 0;
+	struct ts_ops *o;
+
+	write_lock(&ts_mod_lock);
+	list_for_each_entry(o, &ts_ops, list) {
+		if (o == ops) {
+			list_del(&o->list);
+			goto out;
+		}
+	}
+
+	err = -ENOENT;
+out:
+	write_unlock(&ts_mod_lock);
+	return err;
+}
+
+static inline int get_linear_text(int offset, unsigned char **text,
+				  struct ts_config *conf,
+				  struct ts_state *state)
+{
+	unsigned char *string = (unsigned char *) state->args[0];
+	size_t len = state->args[1];
+
+	if (offset < len) {
+		*text = string + offset;
+		return len - offset;
+	} else
+		return 0;
+}
+
+/**
+ * textsearch_find_continuous - search a pattern in continuous/linear text
+ * @conf: search configuration
+ * @state: search state
+ * @text: text to search in
+ * @len: length of text
+ *
+ * A simplified version of textsearch_find() for continuous/linear text.
+ * Call textsearch_next() to retrieve subsequent matches.
+ *
+ * Returns the position of first occurrence of the pattern or a
+ *         negative number if no match was found.
+ */ 
+int textsearch_find_continuous(struct ts_config *conf, struct ts_state *state,
+			       const unsigned char *text, size_t len)
+{
+	conf->get_text = get_linear_text;
+	state->args[0] = (long) text;
+	state->args[1] = len;
+
+	return textsearch_find(conf, state);
+}
+
+/**
+ * textsearch_prepare - Prepare a text search
+ * @algo: name of search algorithm
+ * @pattern: pattern data
+ * @len: length of pattern
+ * @gfp_mask: allocation mask
+ * @autoload: bool indicating whether to autoload modules
+ *
+ * Looks up the search algorithm module and creates a new textsearch
+ * configuration for the specified pattern. Upon completion all
+ * necessary refcnts are held and the configuration must be put back
+ * using textsearch_put() after usage.
+ *
+ * Note: The format of the pattern may not be compatible between
+ *       the various search algorithms.
+ *
+ * Returns a new textsearch configuration according to the specified
+ *         parameters or a ERR_PTR().
+ */
+struct ts_config *textsearch_prepare(const char *algo,
+				     const unsigned char *pattern, size_t len,
+				     int gfp_mask, int autoload)
+{
+	int err = -ENOENT;
+	struct ts_config *conf;
+	struct ts_ops *ops;
+	
+	ops = lookup_ts_algo(algo);
+#ifdef CONFIG_KMOD
+	/* Why not always autoload you may ask. Some users may be
+	 * in a situation where requesting a module may deadlock,
+	 * especially when the module is located on a NFS mount. */
+	if (ops == NULL && autoload) {
+		request_module("ts_%s", algo);
+		ops = lookup_ts_algo(algo);
+	}
+#endif
+
+	if (ops == NULL)
+		goto errout;
+
+	conf = ops->init(pattern, len, gfp_mask);
+	if (IS_ERR(conf)) {
+		err = PTR_ERR(conf);
+		goto errout;
+	}
+
+	conf->ops = ops;
+	return conf;
+
+errout:
+	if (ops)
+		module_put(ops->owner);
+		
+	return ERR_PTR(err);
+}
+
+EXPORT_SYMBOL(textsearch_register);
+EXPORT_SYMBOL(textsearch_unregister);
+EXPORT_SYMBOL(textsearch_prepare);
+EXPORT_SYMBOL(textsearch_find_continuous);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/5] [LIB] Knuth-Morris-Pratt string-matching algorithm
  2005-05-27 22:47 [RFC] textsearch infrastructure et al v2 Thomas Graf
  2005-05-27 22:47 ` [PATCH 1/5] [LIB] textsearch infrastructure Thomas Graf
@ 2005-05-27 22:48 ` Thomas Graf
  2005-05-27 22:48 ` [PATCH 3/5] [LIB] Naive regular expression " Thomas Graf
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Thomas Graf @ 2005-05-27 22:48 UTC (permalink / raw)
  To: netdev; +Cc: Jamal Hadi Salim


Signed-off-by: Thomas Graf <tgraf@suug.ch>

---
commit 5b70ca8eab4c7d7ef884582d9713cdbffa0f4cd4
tree 4d90ca82120da7b308b9a6bf11a1069473ca5d30
parent bf7ae763f13d767bd039703b3ab4f5954561df39
author Thomas Graf <tgraf@suug.ch> Fri, 27 May 2005 23:44:02 +0200
committer Thomas Graf <tgraf@suug.ch> Fri, 27 May 2005 23:44:02 +0200

 lib/Kconfig  |   13 +++++
 lib/Makefile |    2 
 lib/ts_kmp.c |  145 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 160 insertions(+)

Index: lib/Kconfig
===================================================================
--- ab065819ea6e966aa3db4f1c5935c421dd689d2e/lib/Kconfig  (mode:100644)
+++ 4d90ca82120da7b308b9a6bf11a1069473ca5d30/lib/Kconfig  (mode:100644)
@@ -57,5 +57,18 @@
 config REED_SOLOMON_DEC16
 	boolean
 
+menu "Textsearch facility"
+
+config TEXTSEARCH_KMP
+	tristate "Knuth-Morris-Pratt"
+	help
+	  Say Y here if you want to be able to search text using the
+	  Knuth-Morris-Pratt textsearch algorithm.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called ts_kmp.
+
+endmenu
+
 endmenu
 
Index: lib/Makefile
===================================================================
--- ab065819ea6e966aa3db4f1c5935c421dd689d2e/lib/Makefile  (mode:100644)
+++ 4d90ca82120da7b308b9a6bf11a1069473ca5d30/lib/Makefile  (mode:100644)
@@ -33,6 +33,8 @@
 obj-$(CONFIG_ZLIB_DEFLATE) += zlib_deflate/
 obj-$(CONFIG_REED_SOLOMON) += reed_solomon/
 
+obj-$(CONFIG_TEXTSEARCH_KMP) += ts_kmp.o
+
 hostprogs-y	:= gen_crc32table
 clean-files	:= crc32table.h
 
Index: lib/ts_kmp.c
===================================================================
--- /dev/null  (tree:ab065819ea6e966aa3db4f1c5935c421dd689d2e)
+++ 4d90ca82120da7b308b9a6bf11a1069473ca5d30/lib/ts_kmp.c  (mode:100644)
@@ -0,0 +1,145 @@
+/*
+ * lib/ts_kmp.c		Knuth-Morris-Pratt text search implementation
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ * 
+ * Implements a linear-time string-matching algorithm due to Knuth,
+ * Morris, and Pratt [0]. Their algorithm avoids the explicit
+ * computation of the transition function DELTA altogether. Its
+ * matching time is O(n), for n being length(text), using just an
+ * auxiliary function PI[1..m], for m being length(pattern),
+ * precomputed from the pattern in time O(m). The array PI allows
+ * the transition function DELTA to be computed efficiently
+ * "on the fly" as needed. Roughly speaking, for any state
+ * "q" = 0,1,...,m and any character "a" in SIGMA, the value
+ * PI["q"] contains the information that is independent of "a" and
+ * is needed to compute DELTA("q", "a") [1]. Since the array PI
+ * has only m entries, whereas DELTA has O(m|SIGMA|) entries, we
+ * save a factor of |SIGMA| in the preprocessing time by computing
+ * PI rather than DELTA.
+ *
+ * [0] Cormen, Leiserson, Rivest, Stein
+ *     Introdcution to Algorithms, 2nd Edition, MIT Press
+ * [1] See finite automation theory
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/textsearch.h>
+
+struct ts_kmp
+{
+	int		pattern_len;
+	unsigned char *	pattern;
+	unsigned int 	prefix_tbl[0];
+};
+
+static int kmp_find(struct ts_config *conf, struct ts_state *state)
+{
+	struct ts_kmp *kmp = ts_config_priv(conf);
+	int i, q = 0, consumed = state->offset;
+	unsigned char *text;
+	size_t text_len;
+
+	for (;;) {
+		text_len = conf->get_text(consumed, &text, conf, state);
+
+		if (text_len == 0)
+			break;
+
+		for (i = 0; i < text_len; i++) {
+			while (q > 0 && kmp->pattern[q] != text[i])
+				q = kmp->prefix_tbl[q - 1];
+			if (kmp->pattern[q] == text[i])
+				q++;
+			if (q == kmp->pattern_len) {
+				state->offset = consumed + i + 1;
+				return state->offset - kmp->pattern_len;
+			}
+		}
+
+		consumed += text_len;
+	}
+
+	return -1;
+
+}
+
+static inline void compute_prefix_tbl(const unsigned char *pattern, size_t len,
+				      unsigned int *prefix_tbl)
+{
+	unsigned int k, q;
+
+	for (k = 0, q = 1; q < len; q++) {
+		while (k > 0 && pattern[k] != pattern[q])
+			k = prefix_tbl[k-1];
+		if (pattern[k] == pattern[q])
+			k++;
+		prefix_tbl[q] = k;
+	}
+}
+
+static struct ts_config *kmp_init(const unsigned char *pattern, size_t len,
+				  int gfp_mask)
+{
+	struct ts_config *conf;
+	struct ts_kmp *kmp;
+	size_t prefix_tbl_len = len * sizeof(unsigned int);
+	size_t priv_size = sizeof(*kmp) + len + prefix_tbl_len;
+
+	conf = alloc_ts_config(priv_size, gfp_mask);
+	if (IS_ERR(conf))
+		return conf;
+
+	kmp = ts_config_priv(conf);
+	kmp->pattern_len = len;
+	compute_prefix_tbl(pattern, len, kmp->prefix_tbl);
+	kmp->pattern = (unsigned char *) kmp->prefix_tbl + prefix_tbl_len;
+	memcpy(kmp->pattern, pattern, len);
+
+	return conf;
+}
+
+static unsigned char *kmp_get_pattern(struct ts_config *conf)
+{
+	struct ts_kmp *kmp = ts_config_priv(conf);
+	return kmp->pattern;
+}
+
+static unsigned int kmp_get_pattern_len(struct ts_config *conf)
+{
+	struct ts_kmp *kmp = ts_config_priv(conf);
+	return kmp->pattern_len;
+}
+
+static struct ts_ops kmp_ops = {
+	.name		  = "kmp",
+	.find		  = kmp_find,
+	.init		  = kmp_init,
+	.get_pattern	  = kmp_get_pattern,
+	.get_pattern_len  = kmp_get_pattern_len,
+	.owner		  = THIS_MODULE,
+	.list		  = LIST_HEAD_INIT(kmp_ops.list)
+};
+
+static int __init init_kmp(void)
+{
+	return textsearch_register(&kmp_ops);
+}
+
+static void __exit exit_kmp(void)
+{
+	textsearch_unregister(&kmp_ops);
+}
+
+MODULE_LICENSE("GPL");
+
+module_init(init_kmp);
+module_exit(exit_kmp);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/5] [LIB] Naive regular expression string-matching algorithm
  2005-05-27 22:47 [RFC] textsearch infrastructure et al v2 Thomas Graf
  2005-05-27 22:47 ` [PATCH 1/5] [LIB] textsearch infrastructure Thomas Graf
  2005-05-27 22:48 ` [PATCH 2/5] [LIB] Knuth-Morris-Pratt string-matching algorithm Thomas Graf
@ 2005-05-27 22:48 ` Thomas Graf
  2005-05-27 22:48 ` [PATCH 4/5] [NET] Add skb_find_text() to search for a text pattern in skbs Thomas Graf
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Thomas Graf @ 2005-05-27 22:48 UTC (permalink / raw)
  To: netdev; +Cc: Jamal Hadi Salim


Signed-off-by: Thomas Graf <tgraf@suug.ch>

---
commit 81553cdea2a7ff762cb44aeced743d3a77efd2d7
tree 266c4ad512e5ae71bd7cd88c94bfb18fb2f761d6
parent 5b70ca8eab4c7d7ef884582d9713cdbffa0f4cd4
author Thomas Graf <tgraf@suug.ch> Fri, 27 May 2005 23:44:24 +0200
committer Thomas Graf <tgraf@suug.ch> Fri, 27 May 2005 23:44:24 +0200

 include/linux/textsearch_regexp.h |   47 ++++++
 lib/Kconfig                       |    9 +
 lib/Makefile                      |    1 
 lib/ts_regexp.c                   |  272 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 329 insertions(+)

Index: include/linux/textsearch_regexp.h
===================================================================
--- /dev/null  (tree:4d90ca82120da7b308b9a6bf11a1069473ca5d30)
+++ 266c4ad512e5ae71bd7cd88c94bfb18fb2f761d6/include/linux/textsearch_regexp.h  (mode:100644)
@@ -0,0 +1,47 @@
+#ifndef __LINUX_TEXTSEARCH_REGEXP_H
+#define __LINUX_TEXTSEARCH_REGEXP_H
+
+#include <linux/types.h>
+
+enum {
+	TS_RE_SPECIFIC,		/* specific character */
+	TS_RE_WILDCARD,		/* any character */
+	TS_RE_DIGIT,		/* isdigit() */
+	TS_RE_XDIGIT,		/* isxdigit() */
+	TS_RE_PRINT,		/* isprint() */
+	TS_RE_ALPHA,		/* isalpha() */
+	TS_RE_ALNUM,		/* isalnum() */
+	TS_RE_ASCII,		/* isascii() */
+	TS_RE_CNTRL,		/* iscntrl() */
+	TS_RE_GRAPH,		/* isgraph() */
+	TS_RE_LOWER,		/* islower() */
+	TS_RE_UPPER,		/* isupper() */
+	TS_RE_PUNCT,		/* ispunct() */
+	TS_RE_SPACE,		/* isspace() */
+	__TS_RE_TYPE_MAX,
+};
+#define TS_RE_TYPE_MAX (__TS_RE_TYPE_MAX - 1)
+
+enum {
+	TS_RE_SINGLE,		/* 1 occurrence */
+	TS_RE_PERHAPS,		/* 1 or 0 occurrence */
+	TS_RE_ANY,		/* 0..n occurrences */
+	TS_RE_MULTI,		/* 1..n occurrences */
+	__TS_RE_RECUR_MAX,
+};
+#define TS_RE_RECUR_MAX (__TS_RE_RECUR_MAX - 1)
+
+/**
+ * struct ts_regexp_token - regular expression token
+ * @type: type of token
+ * @recur: number of recurrences
+ * @value: character value for TS_RE_SPECIFIC
+ */
+struct ts_regexp_token
+{
+	__u16		type;
+	__u8		recur;
+	__u8		value;
+};
+
+#endif
Index: lib/Kconfig
===================================================================
--- 4d90ca82120da7b308b9a6bf11a1069473ca5d30/lib/Kconfig  (mode:100644)
+++ 266c4ad512e5ae71bd7cd88c94bfb18fb2f761d6/lib/Kconfig  (mode:100644)
@@ -68,6 +68,15 @@
 	  To compile this code as a module, choose M here: the
 	  module will be called ts_kmp.
 
+config TEXTSEARCH_REGEXP
+	tristate "Regular Expression"
+	help
+	  Say Y here if you want to be able to search text using a
+	  naive and limited regular expression textsearch algorithm.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called ts_regexp.
+
 endmenu
 
 endmenu
Index: lib/Makefile
===================================================================
--- 4d90ca82120da7b308b9a6bf11a1069473ca5d30/lib/Makefile  (mode:100644)
+++ 266c4ad512e5ae71bd7cd88c94bfb18fb2f761d6/lib/Makefile  (mode:100644)
@@ -34,6 +34,7 @@
 obj-$(CONFIG_REED_SOLOMON) += reed_solomon/
 
 obj-$(CONFIG_TEXTSEARCH_KMP) += ts_kmp.o
+obj-$(CONFIG_TEXTSEARCH_REGEXP) += ts_regexp.o
 
 hostprogs-y	:= gen_crc32table
 clean-files	:= crc32table.h
Index: lib/ts_regexp.c
===================================================================
--- /dev/null  (tree:4d90ca82120da7b308b9a6bf11a1069473ca5d30)
+++ 266c4ad512e5ae71bd7cd88c94bfb18fb2f761d6/lib/ts_regexp.c  (mode:100644)
@@ -0,0 +1,272 @@
+/*
+ * lib/ts_regexp.c	Naive and very limited regular expression
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/textsearch.h>
+#include <linux/textsearch_regexp.h>
+
+struct ts_regexp
+{
+	int			ntokens;
+	struct ts_regexp_token	tokens[0];
+};
+
+/* other values derived from ctype.h */
+#define _A		0x100 /* ascii */
+#define _W		0x200 /* wildcard */
+
+/* Map to _ctype flags and some magic numbers */
+static u16 token_map[TS_RE_TYPE_MAX+1] = {
+	[TS_RE_SPECIFIC]  = 0,
+	[TS_RE_WILDCARD]  = _W,
+	[TS_RE_CNTRL]	  = _C,
+	[TS_RE_LOWER]	  = _L,
+	[TS_RE_UPPER]	  = _U,
+	[TS_RE_PUNCT]	  = _P,
+	[TS_RE_SPACE]	  = _S,
+	[TS_RE_DIGIT]	  = _D,
+	[TS_RE_XDIGIT]	  = _D | _X,
+	[TS_RE_ALPHA]	  = _U | _L,
+	[TS_RE_ALNUM]	  = _U | _L | _D,
+	[TS_RE_PRINT]	  = _P | _U | _L | _D | _SP,
+	[TS_RE_GRAPH]	  = _P | _U | _L | _D,
+	[TS_RE_ASCII]	  = _A,
+};
+
+static u16 token_lookup_tbl[256] = {
+_W|_A|_C,      _W|_A|_C,     _W|_A|_C,     _W|_A|_C,		/*   0-  3 */
+_W|_A|_C,      _W|_A|_C,     _W|_A|_C,     _W|_A|_C,		/*   4-  7 */
+_W|_A|_C,      _W|_A|_C|_S,  _W|_A|_C|_S,  _W|_A|_C|_S,		/*   8- 11 */
+_W|_A|_C|_S,   _W|_A|_C|_S,  _W|_A|_C,     _W|_A|_C,		/*  12- 15 */
+_W|_A|_C,      _W|_A|_C,     _W|_A|_C,     _W|_A|_C,		/*  16- 19 */
+_W|_A|_C,      _W|_A|_C,     _W|_A|_C,     _W|_A|_C,		/*  20- 23 */
+_W|_A|_C,      _W|_A|_C,     _W|_A|_C,     _W|_A|_C,		/*  24- 27 */
+_W|_A|_C,      _W|_A|_C,     _W|_A|_C,     _W|_A|_C,		/*  28- 31 */
+_W|_A|_S|_SP,  _W|_A|_P,     _W|_A|_P,     _W|_A|_P,		/*  32- 35 */
+_W|_A|_P,      _W|_A|_P,     _W|_A|_P,     _W|_A|_P,		/*  36- 39 */
+_W|_A|_P,      _W|_A|_P,     _W|_A|_P,     _W|_A|_P,		/*  40- 43 */
+_W|_A|_P,      _W|_A|_P,     _W|_A|_P,     _W|_A|_P,		/*  44- 47 */
+_W|_A|_D,      _W|_A|_D,     _W|_A|_D,     _W|_A|_D,		/*  48- 51 */
+_W|_A|_D,      _W|_A|_D,     _W|_A|_D,     _W|_A|_D,		/*  52- 55 */
+_W|_A|_D,      _W|_A|_D,     _W|_A|_P,     _W|_A|_P,		/*  56- 59 */
+_W|_A|_P,      _W|_A|_P,     _W|_A|_P,     _W|_A|_P,		/*  60- 63 */
+_W|_A|_P,      _W|_A|_U|_X,  _W|_A|_U|_X,  _W|_A|_U|_X,		/*  64- 67 */
+_W|_A|_U|_X,   _W|_A|_U|_X,  _W|_A|_U|_X,  _W|_A|_U,		/*  68- 71 */
+_W|_A|_U,      _W|_A|_U,     _W|_A|_U,     _W|_A|_U,		/*  72- 75 */
+_W|_A|_U,      _W|_A|_U,     _W|_A|_U,     _W|_A|_U,		/*  76- 79 */
+_W|_A|_U,      _W|_A|_U,     _W|_A|_U,     _W|_A|_U,		/*  80- 83 */
+_W|_A|_U,      _W|_A|_U,     _W|_A|_U,     _W|_A|_U,		/*  84- 87 */
+_W|_A|_U,      _W|_A|_U,     _W|_A|_U,     _W|_A|_P,		/*  88- 91 */
+_W|_A|_P,      _W|_A|_P,     _W|_A|_P,     _W|_A|_P,		/*  92- 95 */
+_W|_A|_P,      _W|_A|_L|_X,  _W|_A|_L|_X,  _W|_A|_L|_X,		/*  96- 99 */
+_W|_A|_L|_X,   _W|_A|_L|_X,  _W|_A|_L|_X,  _W|_A|_L,		/* 100-103 */
+_W|_A|_L,      _W|_A|_L,     _W|_A|_L,     _W|_A|_L,		/* 104-107 */
+_W|_A|_L,      _W|_A|_L,     _W|_A|_L,     _W|_A|_L,		/* 108-111 */
+_W|_A|_L,      _W|_A|_L,     _W|_A|_L,     _W|_A|_L,		/* 112-115 */
+_W|_A|_L,      _W|_A|_L,     _W|_A|_L,     _W|_A|_L,		/* 116-119 */
+_W|_A|_L,      _W|_A|_L,     _W|_A|_L,     _W|_A|_P,		/* 120-123 */
+_W|_A|_P,      _W|_A|_P,     _W|_A|_P,     _W|_A|_C,		/* 124-127 */
+_W,            _W,           _W,           _W,			/* 128-131 */
+_W,            _W,           _W,           _W,			/* 132-135 */
+_W,            _W,           _W,           _W,			/* 136-139 */
+_W,            _W,           _W,           _W,			/* 140-143 */
+_W,            _W,           _W,           _W,			/* 144-147 */
+_W,            _W,           _W,           _W,			/* 148-151 */
+_W,            _W,           _W,           _W,			/* 152-155 */
+_W,            _W,           _W,           _W,			/* 156-159 */
+_W|_S|_SP,     _W|_P,        _W|_P,        _W|_P,		/* 160-163 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 164-167 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 168-171 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 172-175 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 176-179 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 180-183 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 184-187 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 188-191 */
+_W|_U,         _W|_U,        _W|_U,        _W|_U,		/* 192-195 */
+_W|_U,         _W|_U,        _W|_U,        _W|_U,		/* 196-199 */
+_W|_U,         _W|_U,        _W|_U,        _W|_U,		/* 200-203 */
+_W|_U,         _W|_U,        _W|_U,        _W|_U,		/* 204-207 */
+_W|_U,         _W|_U,        _W|_U,        _W|_U,		/* 208-211 */
+_W|_U,         _W|_U,        _W|_U,        _W|_P,		/* 212-215 */
+_W|_U,         _W|_U,        _W|_U,        _W|_U,		/* 216-219 */
+_W|_U,         _W|_U,        _W|_U,        _W|_L,		/* 220-223 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L,		/* 224-227 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L,		/* 228-231 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L,		/* 232-235 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L,		/* 236-239 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L,		/* 240-243 */
+_W|_L,         _W|_L,        _W|_L,        _W|_P,		/* 244-247 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L,		/* 248-251 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L};		/* 252-255 */
+
+static inline int match_token(struct ts_regexp_token *t, u8 d)
+{
+	if (t->type)
+		return (token_lookup_tbl[d] & t->type) != 0;
+	else
+		return t->value == d;
+}
+
+static int regexp_find(struct ts_config *conf, struct ts_state *state)
+{
+	struct ts_regexp *regexp = ts_config_priv(conf);
+	int i, q, consumed = state->offset;
+	int match_start = state->offset;
+	struct ts_regexp_token *t = NULL, *next;
+	size_t text_len = 0;
+	unsigned char *text;
+
+#define GET_TEXT() \
+	({ i = 0; conf->get_text(consumed, &text, conf, state); })
+
+#define end_of_text()  (i >= text_len && !GET_TEXT())
+#define more_text() (i < text_len || GET_TEXT())
+
+#define NEXT_CHAR() do { i++; consumed++; } while (0)
+	
+	for (i = 0, q = 0; q < regexp->ntokens; q++) {
+		t = &regexp->tokens[q];
+
+		if (likely(q < (regexp->ntokens - 1)))
+			next = &regexp->tokens[q+1];
+		else
+			next = NULL;
+
+		switch (t->recur) {
+			case TS_RE_SINGLE:
+				if (unlikely(end_of_text()))
+					goto no_match;
+
+				if (!match_token(t, text[i]))
+					goto no_match;
+				break;
+
+			case TS_RE_PERHAPS:
+				if (likely(more_text()))
+					if (match_token(t, text[i]))
+						break;
+				continue;
+
+			case TS_RE_MULTI:
+				if (unlikely(end_of_text()))
+					goto no_match;
+
+				if (!match_token(t, text[i]))
+					goto no_match;
+
+				NEXT_CHAR();
+				/* fall through */
+
+			case TS_RE_ANY:
+				if (next == NULL)
+					goto found_match;
+
+				if (likely(more_text())) {
+					while (!match_token(next, text[i])) {
+						if (!match_token(t, text[i]))
+							goto no_match;
+						NEXT_CHAR();
+						if (unlikely(end_of_text()))
+							goto no_match;
+					}
+				}
+				continue;
+		}
+
+		NEXT_CHAR();
+	}
+
+	if (q >= (regexp->ntokens - 1))
+		goto found_match;
+
+no_match:
+	return -1;
+
+found_match:
+	state->offset = consumed + i + 1;
+	return match_start;
+}
+
+static struct ts_config *regexp_init(const unsigned char *pattern, size_t len,
+				     int gfp_mask)
+{
+	int i, err = -EINVAL;
+	struct ts_config *conf;
+	struct ts_regexp *regexp;
+	struct ts_regexp_token *tokens = (struct ts_regexp_token *) pattern;
+
+	if (len  % sizeof(struct ts_regexp_token))
+		goto errout;
+
+	for (i = 0; i < len / sizeof(struct ts_regexp_token); i++) {
+		struct ts_regexp_token *t = &tokens[i];
+
+		if (t->type > TS_RE_TYPE_MAX ||
+		    t->type > TS_RE_RECUR_MAX)
+			goto errout;
+
+		t->type = token_map[t->type];
+	}
+
+	conf = alloc_ts_config(len, gfp_mask);
+	if (IS_ERR(conf))
+		return conf;
+
+	regexp = ts_config_priv(conf);
+	regexp->ntokens = len / sizeof(struct ts_regexp_token);
+	memcpy(regexp->tokens, pattern, len);
+
+	return conf;
+
+errout:
+	return ERR_PTR(err);
+}
+
+static unsigned char *regexp_get_pattern(struct ts_config *conf)
+{
+	struct ts_regexp *regexp = ts_config_priv(conf);
+	return (unsigned char *) regexp->tokens;
+}
+
+static unsigned int regexp_get_pattern_len(struct ts_config *conf)
+{
+	struct ts_regexp *regexp = ts_config_priv(conf);
+	return regexp->ntokens * sizeof(struct ts_regexp_token);
+}
+
+static struct ts_ops regexp_ops = {
+	.name		  = "regexp",
+	.find		  = regexp_find,
+	.init		  = regexp_init,
+	.get_pattern	  = regexp_get_pattern,
+	.get_pattern_len  = regexp_get_pattern_len,
+	.owner		  = THIS_MODULE,
+	.list		  = LIST_HEAD_INIT(regexp_ops.list)
+};
+
+static int __init init_regexp(void)
+{
+	return textsearch_register(&regexp_ops);
+}
+
+static void __exit exit_regexp(void)
+{
+	textsearch_unregister(&regexp_ops);
+}
+
+MODULE_LICENSE("GPL");
+
+module_init(init_regexp);
+module_exit(exit_regexp);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 4/5] [NET] Add skb_find_text() to search for a text pattern in skbs
  2005-05-27 22:47 [RFC] textsearch infrastructure et al v2 Thomas Graf
                   ` (2 preceding siblings ...)
  2005-05-27 22:48 ` [PATCH 3/5] [LIB] Naive regular expression " Thomas Graf
@ 2005-05-27 22:48 ` Thomas Graf
  2005-05-28  3:11   ` Pablo Neira
  2005-05-27 22:49 ` [PATCH 5/5] [PKT_SCHED] textsearch ematch Thomas Graf
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Thomas Graf @ 2005-05-27 22:48 UTC (permalink / raw)
  To: netdev; +Cc: Jamal Hadi Salim


Adds the function skb_find_text() to search for text patterns
in both, linear and non-linear skbs.

Signed-off-by: Thomas Graf <tgraf@suug.ch>

---
commit 4fdcfba0bf99efed709bc8ebaeda0d5ed7e9cc67
tree 13ce1cd98014267f2a353b96e813c1e70bbe9c3b
parent 81553cdea2a7ff762cb44aeced743d3a77efd2d7
author Thomas Graf <tgraf@suug.ch> Fri, 27 May 2005 23:44:50 +0200
committer Thomas Graf <tgraf@suug.ch> Fri, 27 May 2005 23:44:50 +0200

 include/linux/skbuff.h |    4 ++
 net/core/skbuff.c      |   83 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

Index: include/linux/skbuff.h
===================================================================
--- 266c4ad512e5ae71bd7cd88c94bfb18fb2f761d6/include/linux/skbuff.h  (mode:100644)
+++ 13ce1cd98014267f2a353b96e813c1e70bbe9c3b/include/linux/skbuff.h  (mode:100644)
@@ -28,6 +28,7 @@
 #include <linux/poll.h>
 #include <linux/net.h>
 #include <net/checksum.h>
+#include <linux/textsearch.h>
 
 #define HAVE_ALLOC_SKB		/* For the drivers to know */
 #define HAVE_ALIGNABLE_SKB	/* Ditto 8)		   */
@@ -1187,6 +1188,9 @@
 extern void	       skb_split(struct sk_buff *skb,
 				 struct sk_buff *skb1, const u32 len);
 
+extern int skb_find_text(struct sk_buff *skb, int from, int to,
+			 struct ts_config *config, struct ts_state *state);
+
 static inline void *skb_header_pointer(const struct sk_buff *skb, int offset,
 				       int len, void *buffer)
 {
Index: net/core/skbuff.c
===================================================================
--- 266c4ad512e5ae71bd7cd88c94bfb18fb2f761d6/net/core/skbuff.c  (mode:100644)
+++ 13ce1cd98014267f2a353b96e813c1e70bbe9c3b/net/core/skbuff.c  (mode:100644)
@@ -1506,6 +1506,88 @@
 		skb_split_no_header(skb, skb1, len, pos);
 }
 
+static int get_skb_text(int offset, unsigned char **text,
+			struct ts_config *conf, struct ts_state *state)
+{
+	/* args[0]: lower limit
+	 * args[1]: upper limit
+	 * args[2]: skb
+	 * args[3]: fragment index
+	 * args[4]: current fragment data buffer
+	 * args[5]: octets consumed up to previous fragment */
+	int from = state->args[0], to = state->args[1];
+	struct sk_buff *skb = (struct sk_buff *) state->args[2];
+	int limit = min_t(int, skb_headlen(skb), to);
+	int real_offset = offset + from;
+	skb_frag_t *f;
+
+	if (!skb_is_nonlinear(skb)) {
+		if (real_offset < limit) {
+linear:
+			*text = skb->data + real_offset;
+			return limit - real_offset;
+		}
+
+		return 0;
+	}
+
+	if (real_offset < limit)
+		goto linear;
+
+next_fragment:
+	f = &skb_shinfo(skb)->frags[state->args[3]];
+	limit = min_t(int, f->size + state->args[5], to);
+
+	if (!state->args[4])
+		state->args[4] = (long) kmap_skb_frag(f);
+
+	if (real_offset < limit) {
+		*text = (unsigned char *) state->args[4] + f->page_offset +
+			(real_offset - (int) state->args[5]);
+		return limit - real_offset;
+	}
+
+	kunmap_skb_frag((void *) state->args[4]);
+	state->args[3]++;
+	state->args[4] = (long) NULL;
+	state->args[5] += f->size;
+
+	if (state->args[3] >= skb_shinfo(skb)->nr_frags)
+		return 0;
+
+	goto next_fragment;
+}
+
+static void get_skb_text_finish(struct ts_config *conf, struct ts_state *state)
+{
+	if (state->args[4])
+		kunmap_skb_frag((void *) state->args[4]);
+}
+
+/**
+ * skb_find_text - Find a text pattern in skb data
+ * @skb: the buffer to look in
+ * @from: search offset
+ * @to: search limit
+ * @config: textsearch configuration
+ * @state: empty textsearch state variable
+ */
+int skb_find_text(struct sk_buff *skb, int from, int to,
+		  struct ts_config *config, struct ts_state *state)
+{
+	config->get_text = get_skb_text;
+	config->finish = get_skb_text_finish;
+	state->args[0] = from;
+	state->args[1] = to;
+	state->args[2] = (long) skb;
+	state->args[3] = 0;
+	state->args[4] = 0;
+	state->args[5] = skb_headlen(skb);
+
+	return textsearch_find(config, state);
+}
+
+
 void __init skb_init(void)
 {
 	skbuff_head_cache = kmem_cache_create("skbuff_head_cache",
@@ -1544,3 +1626,4 @@
 EXPORT_SYMBOL(skb_unlink);
 EXPORT_SYMBOL(skb_append);
 EXPORT_SYMBOL(skb_split);
+EXPORT_SYMBOL(skb_find_text);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] [NET] Add skb_find_text() to search for a text pattern in skbs
  2005-05-27 22:48 ` [PATCH 4/5] [NET] Add skb_find_text() to search for a text pattern in skbs Thomas Graf
@ 2005-05-28  3:11   ` Pablo Neira
  2005-05-28 11:32     ` Thomas Graf
  0 siblings, 1 reply; 18+ messages in thread
From: Pablo Neira @ 2005-05-28  3:11 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, Jamal Hadi Salim

Thomas Graf wrote:
> +static int get_skb_text(int offset, unsigned char **text,
> +			struct ts_config *conf, struct ts_state *state)
> +{
> +	/* args[0]: lower limit
> +	 * args[1]: upper limit
> +	 * args[2]: skb
> +	 * args[3]: fragment index
> +	 * args[4]: current fragment data buffer
> +	 * args[5]: octets consumed up to previous fragment */
> +	int from = state->args[0], to = state->args[1];
> +	struct sk_buff *skb = (struct sk_buff *) state->args[2];
> +	int limit = min_t(int, skb_headlen(skb), to);
> +	int real_offset = offset + from;
> +	skb_frag_t *f;
> +
> +	if (!skb_is_nonlinear(skb)) {
> +		if (real_offset < limit) {
> +linear:
> +			*text = skb->data + real_offset;
> +			return limit - real_offset;
> +		}
> +
> +		return 0;
> +	}
> +
> +	if (real_offset < limit)
> +		goto linear;
> +
> +next_fragment:
> +	f = &skb_shinfo(skb)->frags[state->args[3]];
> +	limit = min_t(int, f->size + state->args[5], to);
> +
> +	if (!state->args[4])
> +		state->args[4] = (long) kmap_skb_frag(f);
> +
> +	if (real_offset < limit) {
> +		*text = (unsigned char *) state->args[4] + f->page_offset +
> +			(real_offset - (int) state->args[5]);
> +		return limit - real_offset;
> +	}
> +
> +	kunmap_skb_frag((void *) state->args[4]);
> +	state->args[3]++;
> +	state->args[4] = (long) NULL;
> +	state->args[5] += f->size;
> +
> +	if (state->args[3] >= skb_shinfo(skb)->nr_frags)
> +		return 0;
> +
> +	goto next_fragment;
> +}

Still miss something here. You aren't looking for matches in skbs 
inserted in skb_shinfo(skb)->frag_list. Have a look at rusty's skb_iter 
functions. Since he cooked those for this purpose (string matching), why 
don't we use them instead of yours ?

--
Pablo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/5] [NET] Add skb_find_text() to search for a text pattern in skbs
  2005-05-28  3:11   ` Pablo Neira
@ 2005-05-28 11:32     ` Thomas Graf
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Graf @ 2005-05-28 11:32 UTC (permalink / raw)
  To: Pablo Neira; +Cc: netdev, Jamal Hadi Salim

* Pablo Neira <4297E172.5020907@eurodev.net> 2005-05-28 05:11
> Still miss something here. You aren't looking for matches in skbs 
> inserted in skb_shinfo(skb)->frag_list. Have a look at rusty's skb_iter 
> functions. Since he cooked those for this purpose (string matching), why 
> don't we use them instead of yours ?

Absolutely, I'm planning to use rusty's skb_iter instead. I postponed
this to when the core infrastructure is stable but thanks a lot for
this comment.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 5/5] [PKT_SCHED] textsearch ematch
  2005-05-27 22:47 [RFC] textsearch infrastructure et al v2 Thomas Graf
                   ` (3 preceding siblings ...)
  2005-05-27 22:48 ` [PATCH 4/5] [NET] Add skb_find_text() to search for a text pattern in skbs Thomas Graf
@ 2005-05-27 22:49 ` Thomas Graf
  2005-05-28 11:59 ` [RFC] textsearch infrastructure et al v2 jamal
  2005-05-31 21:56 ` David S. Miller
  6 siblings, 0 replies; 18+ messages in thread
From: Thomas Graf @ 2005-05-27 22:49 UTC (permalink / raw)
  To: netdev; +Cc: Jamal Hadi Salim


Signed-off-by: Thomas Graf <tgraf@suug.ch>

---
commit 5d28d54b48451e7833295be68b7e4b6fcc2b5375
tree cc252d153b4b21f347aab71bbfe04c0bf2a3a8a2
parent 4fdcfba0bf99efed709bc8ebaeda0d5ed7e9cc67
author Thomas Graf <tgraf@suug.ch> Fri, 27 May 2005 23:45:15 +0200
committer Thomas Graf <tgraf@suug.ch> Fri, 27 May 2005 23:45:15 +0200

 include/linux/pkt_cls.h              |    1 
 include/linux/tc_ematch/tc_em_text.h |   19 ++++
 net/sched/Kconfig                    |   11 ++
 net/sched/Makefile                   |    1 
 net/sched/em_text.c                  |  142 +++++++++++++++++++++++++++++++++++
 5 files changed, 174 insertions(+)

Index: include/linux/pkt_cls.h
===================================================================
--- 13ce1cd98014267f2a353b96e813c1e70bbe9c3b/include/linux/pkt_cls.h  (mode:100644)
+++ cc252d153b4b21f347aab71bbfe04c0bf2a3a8a2/include/linux/pkt_cls.h  (mode:100644)
@@ -408,6 +408,7 @@
 	TCF_EM_NBYTE,
 	TCF_EM_U32,
 	TCF_EM_META,
+	TCF_EM_TEXT,
 	__TCF_EM_MAX
 };
 
Index: include/linux/tc_ematch/tc_em_text.h
===================================================================
--- /dev/null  (tree:13ce1cd98014267f2a353b96e813c1e70bbe9c3b)
+++ cc252d153b4b21f347aab71bbfe04c0bf2a3a8a2/include/linux/tc_ematch/tc_em_text.h  (mode:100644)
@@ -0,0 +1,19 @@
+#ifndef __LINUX_TC_EM_TEXT_H
+#define __LINUX_TC_EM_TEXT_H
+
+#include <linux/pkt_cls.h>
+
+#define TC_EM_TEXT_ALGOSIZ	16
+
+struct tcf_em_text
+{
+	char		algo[TC_EM_TEXT_ALGOSIZ];
+	__u16		from_offset;
+	__u16		to_offset;
+	__u16		pattern_len;
+	__u8		from_layer:4;
+	__u8		to_layer:4;
+	__u8		pad;
+};
+
+#endif
Index: net/sched/Kconfig
===================================================================
--- 13ce1cd98014267f2a353b96e813c1e70bbe9c3b/net/sched/Kconfig  (mode:100644)
+++ cc252d153b4b21f347aab71bbfe04c0bf2a3a8a2/net/sched/Kconfig  (mode:100644)
@@ -449,6 +449,17 @@
 	  To compile this code as a module, choose M here: the
 	  module will be called em_meta.
 
+config NET_EMATCH_TEXT
+	tristate "Textsearch"
+	depends on NET_EMATCH
+	---help---
+	  Say Y here if you want to be ablt to classify packets based on
+	  textsearch comparisons. Please select the appropriate textsearch
+	  algorithms in the Library section.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called em_text.
+
 config NET_CLS_ACT
 	bool "Packet ACTION"
 	depends on EXPERIMENTAL && NET_CLS && NET_QOS
Index: net/sched/Makefile
===================================================================
--- 13ce1cd98014267f2a353b96e813c1e70bbe9c3b/net/sched/Makefile  (mode:100644)
+++ cc252d153b4b21f347aab71bbfe04c0bf2a3a8a2/net/sched/Makefile  (mode:100644)
@@ -40,3 +40,4 @@
 obj-$(CONFIG_NET_EMATCH_NBYTE)	+= em_nbyte.o
 obj-$(CONFIG_NET_EMATCH_U32)	+= em_u32.o
 obj-$(CONFIG_NET_EMATCH_META)	+= em_meta.o
+obj-$(CONFIG_NET_EMATCH_TEXT)	+= em_text.o
Index: net/sched/em_text.c
===================================================================
--- /dev/null  (tree:13ce1cd98014267f2a353b96e813c1e70bbe9c3b)
+++ cc252d153b4b21f347aab71bbfe04c0bf2a3a8a2/net/sched/em_text.c  (mode:100644)
@@ -0,0 +1,142 @@
+/*
+ * net/sched/em_text.c	Textsearch ematch
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/string.h>
+#include <linux/skbuff.h>
+#include <linux/textsearch.h>
+#include <linux/tc_ematch/tc_em_text.h>
+#include <net/pkt_cls.h>
+
+struct text_match
+{
+	u16			from_offset;
+	u16			to_offset;
+	u8			from_layer;
+	u8			to_layer;
+	struct ts_config	*config;
+};
+
+static int em_text_match(struct sk_buff *skb, struct tcf_ematch *m,
+			 struct tcf_pkt_info *info)
+{
+	struct text_match *tm = (struct text_match *) m->data;
+	int from, to;
+	struct ts_state state;
+
+	from = tcf_get_base_ptr(skb, tm->from_layer) - skb->data;
+	from += tm->from_offset;
+	to = tcf_get_base_ptr(skb, tm->to_layer) - skb->data;
+	to += tm->to_offset;
+
+	if (skb_find_text(skb, from, to, tm->config, &state) < 0)
+		return 0;
+
+	return 1;
+}
+
+static int em_text_change(struct tcf_proto *tp, void *data, int len,
+			  struct tcf_ematch *m)
+{
+	struct text_match *tm;
+	struct tcf_em_text *conf = data;
+	struct ts_config *ts_conf;
+
+	if (len < sizeof(*conf) ||
+	    len < (sizeof(*conf) + conf->pattern_len))
+		return -EINVAL;
+
+	if (conf->from_layer > conf->to_layer)
+		return -EINVAL;
+
+	if (conf->from_layer == conf->to_layer &&
+	    conf->from_offset > conf->to_offset)
+		return -EINVAL;
+
+	ts_conf = textsearch_prepare(conf->algo, (char *) conf + sizeof(*conf),
+				     conf->pattern_len, GFP_KERNEL, 1);
+	if (IS_ERR(ts_conf))
+		return PTR_ERR(ts_conf);
+
+	tm = kmalloc(sizeof(*tm), GFP_KERNEL);
+	if (tm == NULL) {
+		textsearch_destroy(ts_conf);
+		return -ENOBUFS;
+	}
+
+	tm->from_offset = conf->from_offset;
+	tm->to_offset   = conf->to_offset;
+	tm->from_layer  = conf->from_layer;
+	tm->to_layer    = conf->to_layer;
+	tm->config      = ts_conf;
+
+	m->datalen = sizeof(*tm);
+	m->data = (unsigned long) tm;
+
+	return 0;
+}
+
+static void em_text_destroy(struct tcf_proto *tp, struct tcf_ematch *m)
+{
+	struct text_match *tm = (struct text_match *) m->data;
+	textsearch_destroy(tm->config);
+}
+
+static int em_text_dump(struct sk_buff *skb, struct tcf_ematch *m)
+{
+	struct text_match *tm = (struct text_match *) m->data;
+	struct tcf_em_text conf;
+
+	strncpy(conf.algo, tm->config->ops->name, sizeof(conf.algo) - 1);
+	conf.from_offset = tm->from_offset;
+	conf.to_offset = tm->to_offset;
+	conf.from_layer = tm->from_layer;
+	conf.to_layer = tm->to_layer;
+	conf.pattern_len = textsearch_get_pattern_len(tm->config);
+	conf.pad = 0;
+
+	RTA_PUT_NOHDR(skb, sizeof(conf), &conf);
+	RTA_PUT_NOHDR(skb, conf.pattern_len,
+		      textsearch_get_pattern(tm->config));
+	return 0;
+
+rtattr_failure:
+	return -1;
+}		
+
+static struct tcf_ematch_ops em_text_ops = {
+	.kind	  = TCF_EM_TEXT,
+	.change	  = em_text_change,
+	.match	  = em_text_match,
+	.destroy  = em_text_destroy,
+	.dump	  = em_text_dump,
+	.owner	  = THIS_MODULE,
+	.link	  = LIST_HEAD_INIT(em_text_ops.link)
+};
+
+static int __init init_em_text(void)
+{
+	return tcf_em_register(&em_text_ops);
+}
+
+static void __exit exit_em_text(void) 
+{
+	tcf_em_unregister(&em_text_ops);
+}
+
+MODULE_LICENSE("GPL");
+
+module_init(init_em_text);
+module_exit(exit_em_text);

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] textsearch infrastructure et al v2
  2005-05-27 22:47 [RFC] textsearch infrastructure et al v2 Thomas Graf
                   ` (4 preceding siblings ...)
  2005-05-27 22:49 ` [PATCH 5/5] [PKT_SCHED] textsearch ematch Thomas Graf
@ 2005-05-28 11:59 ` jamal
  2005-05-28 12:35   ` Thomas Graf
  2005-05-31 21:56 ` David S. Miller
  6 siblings, 1 reply; 18+ messages in thread
From: jamal @ 2005-05-28 11:59 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev

On Sat, 2005-28-05 at 00:47 +0200, Thomas Graf wrote:
> Dave, I'm going through another RFC round to hopefully get
> some comments on the notes below.
> 
> I'm not yet satistifed with the core infrastructure wrt to
> non-linear data, especially for complex cases such as non
> linear skbs. The current way can be easly described as:
> 
> search_occurrance()
>   search-algorithm()
>     while get-next-segment()
>       process-segment()
> 

I dont have  opinions on this since you and Pablo seem to agree on it.
What I remember is that libssearch (or whatever that thing Harald was
looking at) did it differently (callback invoked and it return a code
which said to continue or not).
Also the design should (I think you do already, just double checking) -
should be wary of optimizing for a specific algorithmn; i see you have
KMP but not Boyer-Moore for example and i am not sure what the
repurcassions of above approach are etc etc.

> So basically we save the state of the segment fetching
> instead of saving the state of the search algorithm which
> would be way too complex for things like regular expression
> string matching.
> 

Can you explain this a little more. Didnt quiet understand.
If you are going across skbs, dont you need to save state?

> In the case of non linear skbs this would lead to:
> 

I think the best approach would be to first linearize then search. 
On the egress side this would mean annoying things like no TSO or what
was posted yesterday USO etc. Same thing with ingress, but i suspect
other than the netiron NIC we probably wont be seeing much of non-linear
skbs on the ingress. 
On ingress as well you are also better off to assemble fragments first
on that side before doing text searches. 
Perhaps implementing an action like the one used by openbsd to
"normalize" packets before a string match.
i.e 
classifier (pkt header) --> normalize --> classifier(string match)-> ?

Infact one could argue that if you are to scan a virus you may need
to assemble more than one skb on ingress (essentially a message).

The only other comment i have is on the patch you called naive regexp;
I think you should probably call it something else instead of regexp
since you invented it.

cheers,
jamal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] textsearch infrastructure et al v2
  2005-05-28 11:59 ` [RFC] textsearch infrastructure et al v2 jamal
@ 2005-05-28 12:35   ` Thomas Graf
  2005-05-28 12:56     ` Pablo Neira
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Graf @ 2005-05-28 12:35 UTC (permalink / raw)
  To: jamal; +Cc: netdev

* jamal <1117281581.6251.68.camel@localhost.localdomain> 2005-05-28 07:59
> I dont have  opinions on this since you and Pablo seem to agree on it.
> What I remember is that libssearch (or whatever that thing Harald was
> looking at) did it differently (callback invoked and it return a code
> which said to continue or not).

Same for my infrastructure with the difference that libqsearch uses
a single input buffer so no chance for non-linear data.

> Also the design should (I think you do already, just double checking) -
> should be wary of optimizing for a specific algorithmn; i see you have
> KMP but not Boyer-Moore for example and i am not sure what the
> repurcassions of above approach are etc etc.

This is a weakness of the current implementation, it could be
addresses via the other method that I proposed. The current patch
doesn't allow for random access over fragment borders which means
that Boyer-Moore would require a temporary buffer with a size of
the pattern length in order to do random access over multiple
fragments. However, I haven't seen any infrastructure yet that can
handle non-linear data with bm wihout a complete prefetch in the
beginning though. You could still implement a bm by either using a
naive search at the borders or simply define the limitation that you
can only look at the first fragment so you'd have to linearize.

> > So basically we save the state of the segment fetching
> > instead of saving the state of the search algorithm which
> > would be way too complex for things like regular expression
> > string matching.
> > 
> 
> Can you explain this a little more. Didnt quiet understand.
> If you are going across skbs, dont you need to save state?

Most people look at it from the angle of iterate over text segments
and apply a text search algorithm. I inverted this and said,
start the text search algorithm and iterate over all text
segments inside the algorithm. This makes many things a lot
easier because we only have to save the state of the iterator.
This brings up another limitation though, we cannot stop a search
in progress and continue later on.

> I think the best approach would be to first linearize then search. 

This is what I'm trying to avoid. ;->

> On ingress as well you are also better off to assemble fragments first
> on that side before doing text searches. 

We can still do this optionally, just because we support it doesn't
mean we have to use it.

> Perhaps implementing an action like the one used by openbsd to
> "normalize" packets before a string match.
> i.e 
> classifier (pkt header) --> normalize --> classifier(string match)-> ?

> Infact one could argue that if you are to scan a virus you may need
> to assemble more than one skb on ingress (essentially a message).

http://oss.sgi.com/archives/netdev/2005-05/msg00298.html

See the section after <<coffee break>>

> The only other comment i have is on the patch you called naive regexp;
> I think you should probably call it something else instead of regexp
> since you invented it.

I was never good at names, any suggestions?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] textsearch infrastructure et al v2
  2005-05-28 12:35   ` Thomas Graf
@ 2005-05-28 12:56     ` Pablo Neira
  2005-05-28 12:58       ` Pablo Neira
                         ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Pablo Neira @ 2005-05-28 12:56 UTC (permalink / raw)
  To: Thomas Graf; +Cc: jamal, netdev

Thomas Graf wrote:
> * jamal <1117281581.6251.68.camel@localhost.localdomain> 2005-05-28 07:59
> 
>>I dont have  opinions on this since you and Pablo seem to agree on it.

to be frank, i'm still willing to propose some changes to Thomas, I'll 
do soon since he's pulled my ear with this second rfc request ;).

>>What I remember is that libssearch (or whatever that thing Harald was
>>looking at) did it differently (callback invoked and it return a code
>>which said to continue or not).
> 
> 
> Same for my infrastructure with the difference that libqsearch uses
> a single input buffer so no chance for non-linear data.

hm, i don't understand quite well, i bet that libqsearch was already 
fragment-aware. Anyway the main difference is that libqsearch wasn't 
designed to be used in user space so, for example, it needed a complete 
rework to reduce dynamic memory allocations.

>>Also the design should (I think you do already, just double checking) -
>>should be wary of optimizing for a specific algorithmn; i see you have
>>KMP but not Boyer-Moore for example and i am not sure what the
>>repurcassions of above approach are etc etc.
> 
> 
> This is a weakness of the current implementation, it could be
> addresses via the other method that I proposed. The current patch
> doesn't allow for random access over fragment borders which means
> that Boyer-Moore would require a temporary buffer with a size of
> the pattern length in order to do random access over multiple
> fragments. However, I haven't seen any infrastructure yet that can
> handle non-linear data with bm wihout a complete prefetch in the
> beginning though. You could still implement a bm by either using a
> naive search at the borders or simply define the limitation that you
> can only look at the first fragment so you'd have to linearize.

For small pattern and long packets, such pattern searching on the 
fragment borders doesn't really hurt the performance. Anyway the matter 
of having several algorithms will let users choose that one that suits 
better their needs.

boyer-moore (BM) and other variants are definitely a must to have. I'm 
still reading some papers about string matching and practical 
applications (p2p traffic recognition based on string matching, ids, etc 
etc) and the most interesting practical results come always from the use 
of BM friends.

>>I think the best approach would be to first linearize then search. 
> 
> 
> This is what I'm trying to avoid. ;->

sure, agree with Thomas.

Netfilter used to follow this approach in early 2.6 kernels and Patrick 
McHardy demostrated with some oprofile stuff that skb_copy_bits 
decreased performance.

I'm not familiar with those problems jamal has mentioned though, could 
they really affect the string matching infrastructure in some way? I 
truly prefer avoiding linearizing skb's.

>>The only other comment i have is on the patch you called naive regexp;
>>I think you should probably call it something else instead of regexp
>>since you invented it.
> 
> 
> I was never good at names, any suggestions?

nor me, i'd propose flamenco-dancing-regexp ;->.

--
Pablo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] textsearch infrastructure et al v2
  2005-05-28 12:56     ` Pablo Neira
@ 2005-05-28 12:58       ` Pablo Neira
  2005-05-28 12:58       ` Pablo Neira
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 18+ messages in thread
From: Pablo Neira @ 2005-05-28 12:58 UTC (permalink / raw)
  To: Pablo Neira; +Cc: Thomas Graf, jamal, netdev

Pablo Neira wrote:
>> Same for my infrastructure with the difference that libqsearch uses
>> a single input buffer so no chance for non-linear data.
> 
> 
> hm, i don't understand quite well, i bet that libqsearch was already 
> fragment-aware. Anyway the main difference is that libqsearch wasn't 
> designed to be used in user space so, for example, it needed a complete 
> rework to reduce dynamic memory allocations.

sorry, i meant '/s/wasn't/was/g'. libqsearch was designed to be used my 
snort.

--
Pablo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] textsearch infrastructure et al v2
  2005-05-28 12:56     ` Pablo Neira
  2005-05-28 12:58       ` Pablo Neira
@ 2005-05-28 12:58       ` Pablo Neira
  2005-05-28 13:58       ` Thomas Graf
  2005-05-31 22:05       ` David S. Miller
  3 siblings, 0 replies; 18+ messages in thread
From: Pablo Neira @ 2005-05-28 12:58 UTC (permalink / raw)
  To: Pablo Neira; +Cc: Thomas Graf, jamal, netdev

Pablo Neira wrote:
>> Same for my infrastructure with the difference that libqsearch uses
>> a single input buffer so no chance for non-linear data.
> 
> 
> hm, i don't understand quite well, i bet that libqsearch was already 
> fragment-aware. Anyway the main difference is that libqsearch wasn't 
> designed to be used in user space so, for example, it needed a complete 
> rework to reduce dynamic memory allocations.

sorry, i meant '/s/wasn't/was/g'. libqsearch was designed to be used by 
snort.

--
Pablo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] textsearch infrastructure et al v2
  2005-05-28 12:56     ` Pablo Neira
  2005-05-28 12:58       ` Pablo Neira
  2005-05-28 12:58       ` Pablo Neira
@ 2005-05-28 13:58       ` Thomas Graf
  2005-05-31 22:05       ` David S. Miller
  3 siblings, 0 replies; 18+ messages in thread
From: Thomas Graf @ 2005-05-28 13:58 UTC (permalink / raw)
  To: Pablo Neira; +Cc: jamal, netdev

* Pablo Neira <42986A85.9060001@eurodev.net> 2005-05-28 14:56
> hm, i don't understand quite well, i bet that libqsearch was already 
> fragment-aware.

I'll rephrase, libqsearch, according to my understanding, is not able
to do optimized scanning over borders, i.e. it uses the same naive
approach as your proposal. This isn't necessarly a weakness, it
heavly depends on the length of the pattern and the amount of
fragments.

One of the major differences between libqsearch and the textsearch
I propose is that libqsearch implements "jokers", a subset of
what ts_regexp.c does. This heavly complicates their reference
implementation of bm. Again, this isn't necessarly a weakness, since
the framework doesn't require the algorithm to actually implement
these.

I cannot really tell which of the proposals we have right now
is the best/fastest/cleanest/fanciest/whatever. For that reason
I decided to transfer as much responsibility as possibly away
from the core into the actual algorithms, respectively made
it configureable via callback functions. It is still possible
to write a ts_(kmp|bm)_state.c which saves the state of the
scanning progress but other than in libqsearch this is not
a requirement.

> For small pattern and long packets, such pattern searching on the 
> fragment borders doesn't really hurt the performance.

Exactly.

> boyer-moore (BM) and other variants are definitely a must to have. I'm 
> still reading some papers about string matching and practical 
> applications (p2p traffic recognition based on string matching, ids, etc 
> etc) and the most interesting practical results come always from the use 
> of BM friends.

Absolultely, I implemented KMP because it doesn't require random
access to the text and thus serves well as a reference implementation.
I'll be glad to see your BM ported.

> I'm not familiar with those problems jamal has mentioned though, could 
> they really affect the string matching infrastructure in some way? I 
> truly prefer avoiding linearizing skb's.

I think we have to differ between non-linear skbs which we should have
without linearization and the scanning through real fragments.

In order to make some progress we have to answer these questions:

 * Do we want to be able to search on non-linear data in general?
   I say: yes

 * Do we want to be able to search on non-linear data which is
   not completely available and/or present at the time the search
   starts? In other words, do we want a search to be interruptable
   to continue later on. e.g. search over multiple skbs without
   queueing them up.
   I say: no

 * Do we want to provide random access to the text to be searched?
   If so, optionally or as a requirement? This merely has impact
   on a) BM could be efficiently implemented w/o naive scanning
   around the borders and b) requires the text segments to be
   completely maped/prepare at the risk they'll never be used. [0]
   I say: probably no

 [0] This impact can be limited by having the user specify strict
     searching area limits.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] textsearch infrastructure et al v2
  2005-05-28 12:56     ` Pablo Neira
                         ` (2 preceding siblings ...)
  2005-05-28 13:58       ` Thomas Graf
@ 2005-05-31 22:05       ` David S. Miller
  3 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 2005-05-31 22:05 UTC (permalink / raw)
  To: pablo; +Cc: tgraf, hadi, netdev

From: Pablo Neira <pablo@eurodev.net>
Date: Sat, 28 May 2005 14:56:37 +0200

> Netfilter used to follow this approach in early 2.6 kernels and Patrick 
> McHardy demostrated with some oprofile stuff that skb_copy_bits 
> decreased performance.

This case got converted into what skb_copy_bits() was probably
meant to be, skb_header_pointer().

The idea is, if it's linear in the SKB already (headers almost
certainly are) just pass back the pointer to it, else copy into
the user provided temporary buffer and return a pointer to that.

The text search stuff could easily do the same thing, using
a 32-byte or so sliding window to run the text search on.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] textsearch infrastructure et al v2
  2005-05-27 22:47 [RFC] textsearch infrastructure et al v2 Thomas Graf
                   ` (5 preceding siblings ...)
  2005-05-28 11:59 ` [RFC] textsearch infrastructure et al v2 jamal
@ 2005-05-31 21:56 ` David S. Miller
  2005-05-31 22:44   ` Thomas Graf
  6 siblings, 1 reply; 18+ messages in thread
From: David S. Miller @ 2005-05-31 21:56 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, hadi

From: Thomas Graf <tgraf@suug.ch>
Date: Sat, 28 May 2005 00:47:25 +0200

> Any other ideas around?

You could just fetch "windows" of data.

You can define this window to be 32 bytes, or whatever.

Then you use skb_copy_data() into this 32 byte window scratch
pad the text searcher uses, and you're done.  It starts looking
like an MPEG stream parser :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] textsearch infrastructure et al v2
  2005-05-31 21:56 ` David S. Miller
@ 2005-05-31 22:44   ` Thomas Graf
  2005-05-31 22:50     ` David S. Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Thomas Graf @ 2005-05-31 22:44 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, hadi

* David S. Miller <20050531.145627.85412348.davem@davemloft.net> 2005-05-31 14:56
> From: Thomas Graf <tgraf@suug.ch>
> Date: Sat, 28 May 2005 00:47:25 +0200
> 
> > Any other ideas around?
> 
> You could just fetch "windows" of data.
> 
> You can define this window to be 32 bytes, or whatever.

Heh, I had something like this in mind. Well, basically the
current behaviour is not different except that the window
is variable and uses the page data directly rather than
copying.

Back on the static window subject, it would definitely
be helpful for right-to-left scan algorithms such as
boyer-moore. However, I think that the overhead due to
the massive map/unmap and copying is bigger than the
costs of naive searches around the fragment borders.

Pablo joined me on the subject, he's currently working
on converting the fragmentation iteration to use Rusty's
skb_iter code. we'll present new work with some numbers
shortly.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC] textsearch infrastructure et al v2
  2005-05-31 22:44   ` Thomas Graf
@ 2005-05-31 22:50     ` David S. Miller
  0 siblings, 0 replies; 18+ messages in thread
From: David S. Miller @ 2005-05-31 22:50 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, hadi

From: Thomas Graf <tgraf@suug.ch>
Date: Wed, 1 Jun 2005 00:44:39 +0200

> Pablo joined me on the subject, he's currently working
> on converting the fragmentation iteration to use Rusty's
> skb_iter code. we'll present new work with some numbers
> shortly.

Sounds good.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2005-05-31 22:50 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-27 22:47 [RFC] textsearch infrastructure et al v2 Thomas Graf
2005-05-27 22:47 ` [PATCH 1/5] [LIB] textsearch infrastructure Thomas Graf
2005-05-27 22:48 ` [PATCH 2/5] [LIB] Knuth-Morris-Pratt string-matching algorithm Thomas Graf
2005-05-27 22:48 ` [PATCH 3/5] [LIB] Naive regular expression " Thomas Graf
2005-05-27 22:48 ` [PATCH 4/5] [NET] Add skb_find_text() to search for a text pattern in skbs Thomas Graf
2005-05-28  3:11   ` Pablo Neira
2005-05-28 11:32     ` Thomas Graf
2005-05-27 22:49 ` [PATCH 5/5] [PKT_SCHED] textsearch ematch Thomas Graf
2005-05-28 11:59 ` [RFC] textsearch infrastructure et al v2 jamal
2005-05-28 12:35   ` Thomas Graf
2005-05-28 12:56     ` Pablo Neira
2005-05-28 12:58       ` Pablo Neira
2005-05-28 12:58       ` Pablo Neira
2005-05-28 13:58       ` Thomas Graf
2005-05-31 22:05       ` David S. Miller
2005-05-31 21:56 ` David S. Miller
2005-05-31 22:44   ` Thomas Graf
2005-05-31 22:50     ` David S. Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).