[RFC] textsearch infrastructure + skb_find

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] textsearch infrastructure + skb_find_text()
@ 2005-05-04 23:40 Thomas Graf
  2005-05-05 12:42 ` jamal
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Thomas Graf @ 2005-05-04 23:40 UTC (permalink / raw)
  To: netdev; +Cc: Pablo Neira

The patch below is a report on the current state of the textsearch
infrastructure and its first user skb_find_text(). The textsearch
is kept as simple as possible but advanced enough to handle non-linear
data such as skb fragments. Unlike in many other approaches the text
input is not seen as a single pointer but rather as a continuously
called callback get_text() until 0 is returned allowing to search
on any kind of data and to implement customized from-to limits.

The patch is separated into 3 parts, the first one being the textsearch
infrastructure itself followed by a simple Knuth-Morris-Pratt
implementation for reference. I'm also working on what could be called
the smallest regular expression implementation ever but I left that
out for now since it still has issues. Last but not least the
function skb_find_text() written in a hurry and probably not yet
correct but you should get the idea. From a userspace perspective
the first user will be an ematch but writing it will be peanuts
so I left it out for now.

Basically what it looks like right now is:

int pos;
struct ts_state;
struct ts_config *conf = textsearch_prepare("kmp", "hanky", 5, GFP_KERNEL, 1);

/* search for "hanky" at offset 20 until end of packet */
for (pos = skb_find_text(skb, 20, INT_MAX, conf, &state;
     pos >= 0;
     pos = textsearch_next(conf, &state)) {
        printk("Need a hanky? I found one at offset %d.\n", pos);
}

textsearch_put(conf);
kfree(conf);

You might wonder about the 1 given to _prepare(),  it indicates whether
to autoload modules because the ematches will need it to be able to drop
rtnl sem.

The code is not tested and cerainly not bug free yet but should compile.

Thoughts?

diff -X dontdiff -Nru linux-2.6.12-rc3.orig/include/linux/textsearch.h linux-2.6.12-rc3/include/linux/textsearch.h
--- linux-2.6.12-rc3.orig/include/linux/textsearch.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3/include/linux/textsearch.h	2005-05-05 00:35:07.000000000 +0200
@@ -0,0 +1,169 @@
+#ifndef __LINUX_TEXTSEARCH_H
+#define __LINUX_TEXTSEARCH_H
+
+#ifdef __KERNEL__
+
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/err.h>
+
+struct ts_config;
+
+/**
+ * struct ts_state - textsearch state
+ * @offset: current offset for next match
+ * @args: for persistant variables of get_text()
+ */
+struct ts_state
+{
+	int			offset;
+	long			args[6];
+};
+
+/**
+ * struct ts_ops - textsearch operations
+ * @name: name of search algorithm
+ * @init: called upon initialization to prepare a search
+ * @find: does the actual matching on a prepared configuration
+ * @owner: reference to algorithm module if existent
+ * @list: operations list we are on
+ */
+struct ts_ops
+{
+	const char		*name;
+	struct ts_config *	(*init)(const unsigned char *, size_t, int);
+	int			(*find)(struct ts_config *,
+					struct ts_state *);
+	struct module		*owner;
+	struct list_head	list;
+};
+
+/**
+ * struct ts_config - textsearch configuration
+ * @ops: textsearch operations
+ * @get_text: callback to fetch text to search in
+ * @len: length of pattern
+ */
+struct ts_config
+{
+	struct ts_ops		*ops;
+
+	/**
+	 * get_text - return next chunk of text
+	 * @offset: Number of bytes consumed by the matcher
+	 * @dst: destination buffer
+	 * @conf: search configuration
+	 * @state: search state
+	 *
+	 * Gets called repeatedly until 0 is returned.Must assign a pointer
+	 * to the start of the next chunk of text to *dst and return the
+	 * length of the chunk or 0 if at the end. offset==0 indicates
+	 * a new search. May store/read persistant values in state->args[].
+	 */
+	int			(*get_text)(int offset, unsigned char **dst,
+					    struct ts_config *conf,
+					    struct ts_state *state);
+
+	/**
+	 * finish - called when the matching has been completed successful
+	 *          or not.
+	 * @conf: search configuration
+	 * @state: search state
+	 */
+	void			(*finish)(struct ts_config *conf,
+					  struct ts_state *state);
+	int			pattern_len;
+};
+
+/* Do no use this function directly */
+static inline int __textsearch_find(struct ts_config *conf,
+				    struct ts_state *state)
+{
+	int ret = conf->ops->find(conf, state);
+
+	if (conf->finish)
+		conf->finish(conf, state);
+
+	return ret;
+}
+
+/**
+ * textsearch_find - search for a pattern in a text
+ * @conf: search configuration
+ * @state: search state
+ *
+ * Peforms the actual search on the prepared configuration.
+ *
+ * Returns the position of first occurrence of the pattern or a
+ *         negative number if no match was found.
+ */ 
+static inline int textsearch_find(struct ts_config *conf,
+				   struct ts_state *state)
+{
+	state->offset = 0;
+	return __textsearch_find(conf, state);
+}
+
+/**
+ * textsearch_next - continue search for a pattern in a text
+ * @conf: search configuration
+ * @state: search state
+ *
+ * Continues a search looking for more occurrences of the pattern.
+ * You must call textsearch_find() to search the first occurrence
+ * in order to reset the state.
+ *
+ * Returns the position of next occurance of the pattern or a
+ *         negative number if no match was found.
+ */ 
+static inline int textsearch_next(struct ts_config *conf,
+				  struct ts_state *state)
+{
+	state->offset += conf->pattern_len;
+	return __textsearch_find(conf, state);
+}
+
+/**
+ * textsearch_put - give back a textsearch configuration
+ * @conf: search configuration
+ *
+ * Releases all references of the configuration. Must be
+ * called prior to freeing the object.
+ */
+static inline void textsearch_put(struct ts_config *conf)
+{
+	if (conf->ops)
+		module_put(conf->ops->owner);
+}
+
+/**
+ * alloc_ts_config - allocate a textsearch configuration
+ * @payload: size of additional module specific data required
+ * @gfp_mask: allocation mask
+ *
+ * Returns a new, empty textsearch configuration or a ERR_PTR().
+ */
+static inline struct ts_config *alloc_ts_config(size_t payload, int gfp_mask)
+{
+	struct ts_config *conf;
+
+	conf = kmalloc(sizeof(*conf) + payload, gfp_mask);
+	if (conf == NULL)
+		return ERR_PTR(-ENOMEM);
+
+	memset(conf, 0, sizeof(*conf) + payload);
+	return conf;
+}
+
+extern int textsearch_register(struct ts_ops *);
+extern int textsearch_unregister(struct ts_ops *);
+extern struct ts_config *textsearch_prepare(const char *, const unsigned char *,
+					    size_t, int, int);
+extern int textsearch_find_continuous(struct ts_config *, struct ts_state *,
+				      const unsigned char *, size_t);
+
+#endif /* __KERNEL__ */
+
+#endif
diff -X dontdiff -Nru linux-2.6.12-rc3.orig/lib/textsearch.c linux-2.6.12-rc3/lib/textsearch.c
--- linux-2.6.12-rc3.orig/lib/textsearch.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3/lib/textsearch.c	2005-05-04 23:01:42.000000000 +0200
@@ -0,0 +1,241 @@
+/*
+ * lib/textsearch.c	Generic text search interface
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ *
+ * ==========================================================================
+ *
+ * The textsearch infrastructure provides text searching facitilies for both
+ * linear and non-linear data. Unlike many other 
+ *
+ * Before a textsearch can be performed, a configuration must be created
+ * by calling textsearch_prepare() specyfing the searching algorithm and
+ * the pattern to match. The returned configuration may then be used for
+ * an arbitary amount of times and even in parallel as long as you
+ * provide a separate struct ts_state variable for every instance.
+ *
+ * The actual search is performed by either calling textsearch_find_text()
+ * for linear text or by providing an own get_text() implementation and
+ * calling textsearch_find(). Both functions will returns the position
+ * of the first matching occurrence of the patern. Subsequent matches
+ * may be retrived via textsearch_next() irrespective of the linearity
+ * of the text.
+ *
+ * Once you're done using a configuration you must give back the
+ * allocted resources by calling textsearch_put().
+ *
+ * Example:
+ *   int pos;
+ *   struct ts_config *conf;
+ *   struct ts_state state;
+ *   conar char *pattern = "chicken";
+ *   const char *example = "We dance the funky chicken";
+ *
+ *   conf = textsearch_prepare("kmp", pattern, strlen(pattern), GFP_KERNEL, 1);
+ *   if (IS_ERR(conf)) {
+ *       err = PTR_ERR(conf);
+ *       goto errout;
+ *   }
+ *
+ *   pos = textsearch_find_text(conf, &state, example, strlen(example));
+ *   if (pos >= 0)
+ *       panic("Oh my god, dancing chickens at %d\n", pos);
+ *
+ *   textsearch_put(conf);
+ *   kfree(conf);
+ *
+ * ==========================================================================
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/err.h>
+#include <linux/textsearch.h>
+
+static LIST_HEAD(ts_ops);
+static DEFINE_RWLOCK(ts_mod_lock);
+
+static inline struct ts_ops *lookup_ts_algo(const char *name)
+{
+	struct ts_ops *o;
+
+	read_lock(&ts_mod_lock);
+	list_for_each_entry(o, &ts_ops, list) {
+		if (!strcmp(name, o->name)) {
+			if (!try_module_get(o->owner))
+				o = NULL;
+			read_unlock(&ts_mod_lock);
+			return o;
+		}
+	}
+	read_unlock(&ts_mod_lock);
+
+	return NULL;
+}
+
+/**
+ * textsearch_register - register a textsearch module
+ * @ops: operations lookup table
+ *
+ * This function must be called by textsearch modules to announce
+ * their presence. The given @ops must have %name set to a unique
+ * identifier and the callbacks find() and init() must be implemented.
+ *
+ * Returns -EEXISTS if another module has already registered with
+ *         same name.
+ */
+int textsearch_register(struct ts_ops *ops)
+{
+	int err = -EEXIST;
+	struct ts_ops *o;
+
+	if (ops->name == NULL || ops->find == NULL || ops->init == NULL)
+		return -EINVAL;
+
+	write_lock(&ts_mod_lock);
+	list_for_each_entry(o, &ts_ops, list)
+		if (!strcmp(ops->name, o->name))
+			goto errout;
+
+	list_add_tail(&ops->list, &ts_ops);
+	err = 0;
+errout:
+	write_unlock(&ts_mod_lock);
+	return err;
+}
+
+/**
+ * textsearch_unregister - unregister a textsearch module
+ * @ops: operations lookup table
+ *
+ * This function must be called by textsearch modules to announce
+ * their disappearance for examples when the module gets unloaded.
+ * The @ops parameter must be the same as the one during the
+ * registration.
+ *
+ * Returns -ENOENT if no matching textsearch registration was found.
+ */
+int textsearch_unregister(struct ts_ops *ops)
+{
+	int err = 0;
+	struct ts_ops *o;
+
+	write_lock(&ts_mod_lock);
+	list_for_each_entry(o, &ts_ops, list) {
+		if (o == ops) {
+			list_del(&o->list);
+			goto out;
+		}
+	}
+
+	err = -ENOENT;
+out:
+	write_unlock(&ts_mod_lock);
+	return err;
+}
+
+static inline int get_linear_text(int offset, unsigned char **text,
+				  struct ts_config *conf,
+				  struct ts_state *state)
+{
+	unsigned char *string = (unsigned char *) state->args[0];
+	size_t len = state->args[1];
+
+	if (offset < len) {
+		*text = string + offset;
+		return len - offset;
+	} else
+		return 0;
+}
+
+/**
+ * textsearch_find_continuous - search a pattern in continuous/linear text
+ * @conf: search configuration
+ * @state: search state
+ * @text: text to search in
+ * @len: length of text
+ *
+ * A simplified version of textsearch_find() for continuous/linear text.
+ * Call textsearch_next() to retrieve subsequent matches.
+ *
+ * Returns the position of first occurrence of the pattern or a
+ *         negative number if no match was found.
+ */ 
+int textsearch_find_continuous(struct ts_config *conf, struct ts_state *state,
+				const unsigned char *text, size_t len)
+{
+	conf->get_text = get_linear_text;
+	state->args[0] = (long) text;
+	state->args[1] = len;
+
+	return textsearch_find(conf, state);
+}
+
+/**
+ * textsearch_prepare - Prepare a text search
+ * @algo: name of search algorithm
+ * @pattern: pattern data
+ * @len: length of pattern
+ * @gfp_mask: allocation mask
+ * @autoload: bool indicating whether to autoload modules
+ *
+ * Looks up the search algorithm module and creates a new textsearch
+ * configuration for the specified pattern. Upon completion all
+ * necessary refcnts are held and the configuration must be put back
+ * using textsearch_put() after usage.
+ *
+ * Note: The format of the pattern may not be compatible between
+ *       the various search algorithms.
+ *
+ * Returns a new textsearch configuration according to the specified
+ *         parameters or a ERR_PTR().
+ */
+struct ts_config *textsearch_prepare(const char *algo,
+				     const unsigned char *pattern, size_t len,
+				     int gfp_mask, int autoload)
+{
+	int err = -ENOENT;
+	struct ts_config *conf;
+	struct ts_ops *ops;
+	
+	ops = lookup_ts_algo(algo);
+#ifdef CONFIG_KMOD
+	/* Why not always autoload you may ask. Some users may be
+	 * in a situation where requesting a module may deadlock,
+	 * especially when the module is located on a NFS mount. */
+	if (ops == NULL && autoload) {
+		request_module("ts_%s", algo);
+		ops = lookup_ts_algo(algo);
+	}
+#endif
+
+	if (ops == NULL)
+		goto errout;
+
+	conf = ops->init(pattern, len, gfp_mask);
+	if (IS_ERR(conf)) {
+		err = PTR_ERR(conf);
+		goto errout;
+	}
+
+	conf->ops = ops;
+	return conf;
+
+errout:
+	if (ops)
+		module_put(ops->owner);
+		
+	return ERR_PTR(err);
+}
+
+EXPORT_SYMBOL(textsearch_register);
+EXPORT_SYMBOL(textsearch_unregister);
+EXPORT_SYMBOL(textsearch_prepare);
+EXPORT_SYMBOL(textsearch_find_continuous);
diff -X dontdiff -Nru linux-2.6.12-rc3.orig/lib/ts_kmp.c linux-2.6.12-rc3/lib/ts_kmp.c
--- linux-2.6.12-rc3.orig/lib/ts_kmp.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3/lib/ts_kmp.c	2005-05-05 00:32:28.000000000 +0200
@@ -0,0 +1,108 @@
+/*
+ * lib/ts_kmp.c		Knuth-Morris-Pratt text search implementation
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/textsearch.h>
+
+#define TS_KMP_PREFIX_TBL(conf) \
+	((unsigned int *) ((unsigned char *) (conf) \
+			   + sizeof(struct ts_config)))
+
+#define TS_KMP_PATTERN(conf) \
+	((unsigned char *) TS_KMP_PREFIX_TBL(conf) \
+			   + (conf->pattern_len * sizeof(unsigned int)))
+
+static int kmp_find(struct ts_config *conf, struct ts_state *state)
+{
+	int i, q = 0, consumed = state->offset;
+	unsigned char *text, *pattern = TS_KMP_PATTERN(conf);
+	unsigned int *prefix_tbl = TS_KMP_PREFIX_TBL(conf);
+	size_t text_len, pattern_len = conf->pattern_len;
+
+	for (;;) {
+		text_len = conf->get_text(consumed, &text, conf, state);
+
+		if (text_len == 0)
+			break;
+
+		for (i = 0; i < text_len; i++) {
+			while (q > 0 && pattern[q] != text[i])
+				q = prefix_tbl[q - 1];
+			if (pattern[q] == text[i])
+				q++;
+			if (q == pattern_len) {
+				state->offset = consumed + i - pattern_len + 1;
+				return state->offset;
+			}
+		}
+
+		consumed += text_len;
+	}
+
+	return -1;
+
+}
+
+static inline void compute_prefix_tbl(const unsigned char *pattern, size_t len,
+				      unsigned int *prefix_tbl)
+{
+	unsigned int k, q;
+
+	for (k = 0, q = 1; q < len; q++) {
+		while (k > 0 && pattern[k] != pattern[q])
+			k = prefix_tbl[k-1];
+		if (pattern[k] == pattern[q])
+			k++;
+		prefix_tbl[q] = k;
+	}
+}
+
+static struct ts_config *kmp_init(const unsigned char *pattern, size_t len,
+				  int gfp_mask)
+{
+	struct ts_config *conf;
+
+	conf = alloc_ts_config(len + (len * sizeof(int)), gfp_mask);
+	if (IS_ERR(conf))
+		return conf;
+
+	conf->pattern_len = len;
+	memcpy(TS_KMP_PATTERN(conf), pattern, len);
+	compute_prefix_tbl(pattern, len, TS_KMP_PREFIX_TBL(conf));
+
+	return conf;
+}
+
+static struct ts_ops kmp_ops = {
+	.name	= "kmp",
+	.find	= kmp_find,
+	.init	= kmp_init,
+	.owner	= THIS_MODULE,
+	.list	= LIST_HEAD_INIT(kmp_ops.list)
+};
+
+static int __init init_kmp(void)
+{
+	return textsearch_register(&kmp_ops);
+}
+
+static void __exit exit_kmp(void)
+{
+	textsearch_unregister(&kmp_ops);
+}
+
+MODULE_LICENSE("GPL");
+
+module_init(init_kmp);
+module_exit(exit_kmp);
diff -X dontdiff -Nru linux-2.6.12-rc3.orig/net/core/skbuff.c linux-2.6.12-rc3/net/core/skbuff.c
--- linux-2.6.12-rc3.orig/net/core/skbuff.c	2005-04-30 20:22:24.000000000 +0200
+++ linux-2.6.12-rc3/net/core/skbuff.c	2005-05-05 00:35:50.000000000 +0200
@@ -1502,6 +1502,80 @@
 		skb_split_no_header(skb, skb1, len, pos);
 }
 
+static int get_skb_text(int offset, unsigned char **text,
+			struct ts_config *conf, struct ts_state *state)
+{
+	/* args[0]: lower limit
+	 * args[1]: upper limit
+	 * args[2]: skb
+	 * args[3]: fragment index
+	 * args[4]: current fragment data buffer
+	 * args[5]: octets consumed up to previous fragment */
+	int from = state->args[0], to = state->args[1];
+	struct sk_buff *skb = (struct sk_buff *) state->args[2];
+	int limit = min_t(int, skb_headlen(skb), to);
+	int real_offset = offset + from;
+	skb_frag_t *f;
+
+	if (!skb_is_nonlinear(skb)) {
+		if (real_offset < limit) {
+linear:
+			*text = skb->data + real_offset;
+			return limit - real_offset;
+		}
+
+		return 0;
+	}
+
+	if (real_offset < limit)
+		goto linear;
+
+next_fragment:
+	f = &skb_shinfo(skb)->frags[state->args[3]];
+	limit = min_t(int, f->size + state->args[5], to);
+
+	if (!state->args[4])
+		state->args[4] = (long) kmap_skb_frag(f);
+
+	if (real_offset < limit) {
+		*text = (unsigned char *) state->args[4] + f->page_offset +
+			(real_offset - (int) state->args[5]);
+		return limit - real_offset;
+	}
+
+	kunmap_skb_frag((void *) state->args[4]);
+	state->args[3]++;
+	state->args[4] = (long) NULL;
+	state->args[5] += f->size;
+
+	if (state->args[3] >= skb_shinfo(skb)->nr_frags)
+		return 0;
+
+	goto next_fragment;
+}
+
+static void get_skb_text_finish(struct ts_config *conf, struct ts_state *state)
+{
+	if (state->args[4])
+		kunmap_skb_frag((void *) state->args[4]);
+}
+
+int skb_find_text(struct sk_buff *skb, int from, int to,
+		  struct ts_config *config, struct ts_state *state)
+{
+	config->get_text = get_skb_text;
+	config->finish = get_skb_text_finish;
+	state->args[0] = from;
+	state->args[1] = to;
+	state->args[2] = (long) skb;
+	state->args[3] = 0;
+	state->args[4] = 0;
+	state->args[5] = skb_headlen(skb);
+
+	return textsearch_find(config, state);
+}
+
+
 void __init skb_init(void)
 {
 	skbuff_head_cache = kmem_cache_create("skbuff_head_cache",
@@ -1540,3 +1614,4 @@
 EXPORT_SYMBOL(skb_unlink);
 EXPORT_SYMBOL(skb_append);
 EXPORT_SYMBOL(skb_split);
+EXPORT_SYMBOL(skb_find_text);
diff -X dontdiff -Nru linux-2.6.12-rc3.orig/include/linux/skbuff.h linux-2.6.12-rc3/include/linux/skbuff.h
--- linux-2.6.12-rc3.orig/include/linux/skbuff.h	2005-04-30 20:22:19.000000000 +0200
+++ linux-2.6.12-rc3/include/linux/skbuff.h	2005-05-05 00:12:46.000000000 +0200
@@ -28,6 +28,7 @@
 #include <linux/poll.h>
 #include <linux/net.h>
 #include <net/checksum.h>
+#include <linux/textsearch.h>
 
 #define HAVE_ALLOC_SKB		/* For the drivers to know */
 #define HAVE_ALIGNABLE_SKB	/* Ditto 8)		   */
@@ -1186,6 +1187,9 @@
 extern void	       skb_split(struct sk_buff *skb,
 				 struct sk_buff *skb1, const u32 len);
 
+extern int skb_find_text(struct sk_buff *skb, int from, int to,
+			 struct ts_config *config, struct ts_state *state);
+
 static inline void *skb_header_pointer(const struct sk_buff *skb, int offset,
 				       int len, void *buffer)
 {

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-04 23:40 [RFC] textsearch infrastructure + skb_find_text() Thomas Graf
@ 2005-05-05 12:42 ` jamal
  2005-05-05 14:12   ` Thomas Graf
  2005-05-05 17:02 ` Pablo Neira
  2005-05-06 21:44 ` Thomas Graf
  2 siblings, 1 reply; 14+ messages in thread
From: jamal @ 2005-05-05 12:42 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, Pablo Neira

On Thu, 2005-05-05 at 01:40 +0200, Thomas Graf wrote:
> The patch below is a report on the current state of the textsearch
> infrastructure and its first user skb_find_text(). The textsearch
> is kept as simple as possible but advanced enough to handle non-linear
> data such as skb fragments. Unlike in many other approaches the text
> input is not seen as a single pointer but rather as a continuously
> called callback get_text() until 0 is returned allowing to search
> on any kind of data and to implement customized from-to limits.
> 

How is this different from libqsearch? IIRC, it also kept pointers and
callbacks.

BTW, I hope theres sync with libqsearch - at least some canibalization
of ideas.
Also hopefully, pluggin of ne algorithms is trivial (e.g boyer-moore
could be included in addition to kmp etc)

> The patch is separated into 3 parts, the first one being the textsearch
> infrastructure itself followed by a simple Knuth-Morris-Pratt
> implementation for reference. I'm also working on what could be called
> the smallest regular expression implementation ever but I left that
> out for now since it still has issues. Last but not least the
> function skb_find_text() written in a hurry and probably not yet
> correct but you should get the idea. From a userspace perspective
> the first user will be an ematch but writing it will be peanuts
> so I left it out for now.
> 

nice

> Basically what it looks like right now is:
> 
> int pos;
> struct ts_state;
> struct ts_config *conf = textsearch_prepare("kmp", "hanky", 5, GFP_KERNEL, 1);
>
> /* search for "hanky" at offset 20 until end of packet */
> for (pos = skb_find_text(skb, 20, INT_MAX, conf, &state;
>      pos >= 0;
>      pos = textsearch_next(conf, &state)) {
>         printk("Need a hanky? I found one at offset %d.\n", pos);
> }
> 

I have a lot of questions:
- does a string have to be terminated by \0?
- do you keep state of the string from the begining? ex: how do you know
that preceeding "hanky" was "Need a"?
- all sorts of limits: how long is the string? etc
- what happens if a string spans multiple skbs or even multiple
fragments?

> textsearch_put(conf);
> kfree(conf);
> 
> You might wonder about the 1 given to _prepare(),  it indicates whether
> to autoload modules because the ematches will need it to be able to drop
> rtnl sem.
> 

do you really wanna leave that decision upto the user?

> The code is not tested and cerainly not bug free yet but should compile.
> 
> Thoughts?

I dont have time to look at the patch to sufficiently critique it, but
it looks like a good start - maybe this weekend.
It would be nice to have other utilities which could be loaded eg; case
compare, regualr expressions, strchr after you match, etc
Of course all this to be followed by actions such as strok etc.
Probably all this is a layer above this - but essentially when you are
doing this keep the desire to do this in mind.

cheers,
jamal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-05 12:42 ` jamal
@ 2005-05-05 14:12   ` Thomas Graf
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Graf @ 2005-05-05 14:12 UTC (permalink / raw)
  To: jamal; +Cc: netdev, Pablo Neira

* jamal <1115296937.7680.52.camel@localhost.localdomain> 2005-05-05 08:42
> How is this different from libqsearch? IIRC, it also kept pointers and
> callbacks.

The main difference is that my infrastructure is much simpler.

> BTW, I hope theres sync with libqsearch - at least some canibalization
> of ideas.

I read the libqsearch code but fixing up all the issues required
for my use would have taken up more time than writing up something new.

> Also hopefully, pluggin of ne algorithms is trivial (e.g boyer-moore
> could be included in addition to kmp etc)

Very simple, ts_kmp.c is 108 lines whereas simple.c in libqsearch
is over 500.

> I have a lot of questions:
> - does a string have to be terminated by \0?

All strings are length terminated.

> - do you keep state of the string from the begining? ex: how do you know
> that preceeding "hanky" was "Need a"?

I store the shift of a match in struct ts_state, add the length of
the pattern in the call to textsearch_next() and provide this shift
as offset to the get_text() callback.

> - all sorts of limits: how long is the string? etc

Both the length of a pattern and the length of a text block is
limited by INT_MAX. However, since there can be an arbitary number
of blocks there is no limit in the text length. Depending on the
search algorithm there might be more limits, for example kmp uses
unsigned int for each prefix table entry so this limits the length
of the pattern as well (theoretically).

> - what happens if a string spans multiple skbs or even multiple
> fragments?

This was the single most important requirement. Actually you can
compose the text out of anything. I already wrote some code to
handle paged skbs and multiple skbs can be implemented the same
way. You could for example store a ts_state per conntrack and
call textsearch_next continuously until you find a match. Would
require some magic in the get_text() but shouldn't be too hard.

> 
> > You might wonder about the 1 given to _prepare(),  it indicates whether
> > to autoload modules because the ematches will need it to be able to drop
> > rtnl sem.
> > 
> 
> do you really wanna leave that decision upto the user?

We have to, otherwise we can't use it in ematches without the risk
of deadlocks. There is no way around than have the caller drop its
locks and call prepare again with no locks held, at least I'm not
aware of one.

> It would be nice to have other utilities which could be loaded eg; case
> compare, regualr expressions, strchr after you match, etc

Indeed, I'm working on the regular expression thing but it has
some issues with textsearch_next() for patterns like .+abc.+ which
I want to resolve first.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-04 23:40 [RFC] textsearch infrastructure + skb_find_text() Thomas Graf
  2005-05-05 12:42 ` jamal
@ 2005-05-05 17:02 ` Pablo Neira
  2005-05-05 17:42   ` Thomas Graf
  2005-05-06 21:44 ` Thomas Graf
  2 siblings, 1 reply; 14+ messages in thread
From: Pablo Neira @ 2005-05-05 17:02 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, jamal

Hi Thomas,

Thomas Graf wrote:
> The patch is separated into 3 parts, the first one being the textsearch
> infrastructure itself followed by a simple Knuth-Morris-Pratt
> implementation for reference. I'm also working on what could be called
> the smallest regular expression implementation ever but I left that
> out for now since it still has issues. Last but not least the
> function skb_find_text() written in a hurry and probably not yet
> correct but you should get the idea. From a userspace perspective
> the first user will be an ematch but writing it will be peanuts
> so I left it out for now.
> 
> Basically what it looks like right now is:
> 
> int pos;
> struct ts_state;
> struct ts_config *conf = textsearch_prepare("kmp", "hanky", 5, GFP_KERNEL, 1);
> 
> /* search for "hanky" at offset 20 until end of packet */
> for (pos = skb_find_text(skb, 20, INT_MAX, conf, &state;
>      pos >= 0;
>      pos = textsearch_next(conf, &state)) {
>         printk("Need a hanky? I found one at offset %d.\n", pos);
> }
> 
> textsearch_put(conf);
> kfree(conf);

I haven't got too much time to review this stuff in deep though. 
Impressive work, but I still miss something, some comments:

- A custom destroy function in ts_ops.
- I don't see a way to look for matches in fragments. I mean, say we've 
got "dancing " in a fragment and "chicken" in the next one. Currently we 
don't get a match.
- I particularly like Rusty's skb iterator, well you've refactored that 
code anyway.

I've been reworking the framework for string matching that I sent you 
two/three months ago, you've definitely worked on a good base. Since 
then I've introduced a lot of changes and actually I've been testing it 
(ick, that means that we've clashed!).

I think that I can merge both works and then roll on. I still need more 
time to study more in deep your proposition.

--
Pablo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-05 17:02 ` Pablo Neira
@ 2005-05-05 17:42   ` Thomas Graf
  2005-05-06  1:33     ` Pablo Neira
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Graf @ 2005-05-05 17:42 UTC (permalink / raw)
  To: Pablo Neira; +Cc: netdev, jamal

* Pablo Neira <427A51A2.8090600@eurodev.net> 2005-05-05 19:02
> - A custom destroy function in ts_ops.

Sure, can be added once we need it. Currently none of my
algorithms allocate anything on their own. Might be worth
to change the API and replace textsearch_put with textesearch_destroy
also freeing the configuration though.

> - I don't see a way to look for matches in fragments. I mean, say we've 
> got "dancing " in a fragment and "chicken" in the next one. Currently we 
> don't get a match.

This is implemented in skb_find_text(), I just looked through rusty's
code and it's very simliar so we could make the args in ts_state a
union of long args[6] and char cb[XXX] so we can store skb_iter
in there.

> I've been reworking the framework for string matching that I sent you 
> two/three months ago, you've definitely worked on a good base. Since 
> then I've introduced a lot of changes and actually I've been testing it 
> (ick, that means that we've clashed!).

I don't think so, my approach is quite different and allows for
optimized text searches over fragments, multiple skbs, etc without
any hacks around the fragment borders. It's a lot more generic
allowing it to be in lib/ so it can be used from other subsystem
as well.

> I think that I can merge both works and then roll on. I still need more 
> time to study more in deep your proposition.

Sure go ahead but please don't merge it into netfilter, a) the core
really is supposed to be in lib, and b) I don't want any netfilter
dependencies in my ematches.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-05 17:42   ` Thomas Graf
@ 2005-05-06  1:33     ` Pablo Neira
  2005-05-06 12:36       ` Thomas Graf
  0 siblings, 1 reply; 14+ messages in thread
From: Pablo Neira @ 2005-05-06  1:33 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, jamal

Thomas Graf wrote:
> * Pablo Neira <427A51A2.8090600@eurodev.net> 2005-05-05 19:02
> 
>>- A custom destroy function in ts_ops.
> 
> 
> Sure, can be added once we need it. Currently none of my
> algorithms allocate anything on their own. Might be worth
> to change the API and replace textsearch_put with textesearch_destroy
> also freeing the configuration though.

that's fine

>>- I don't see a way to look for matches in fragments. I mean, say we've 
>>got "dancing " in a fragment and "chicken" in the next one. Currently we 
>>don't get a match.
> 
> 
> This is implemented in skb_find_text(), I just looked through rusty's
> code and it's very simliar so we could make the args in ts_state a
> union of long args[6] and char cb[XXX] so we can store skb_iter
> in there.

Yes, we could just use one of those args in ts_state to store if we 
found a partial match or not.

>>I've been reworking the framework for string matching that I sent you 
>>two/three months ago, you've definitely worked on a good base. Since 
>>then I've introduced a lot of changes and actually I've been testing it 
>>(ick, that means that we've clashed!).
> 
> 
> I don't think so, my approach is quite different and allows for
> optimized text searches over fragments, multiple skbs, etc without
> any hacks around the fragment borders.

Boyer-Moore requires such hack to look for matches around the fragment 
borders. But no problem, such hack will go inside the bm_find function.

> It's a lot more generic
> allowing it to be in lib/ so it can be used from other subsystem
> as well.

Sure, your approach is definitely more generic because I just thought 
about using such string matching framework for net apps, it's different 
from that point of view, but I still see things that reminds me to my 
proposed framework, say that ts_state thing ;).

>>I think that I can merge both works and then roll on. I still need more 
>>time to study more in deep your proposition.
> 
> 
> Sure go ahead but please don't merge it into netfilter, a) the core
> really is supposed to be in lib, and b) I don't want any netfilter
> dependencies in my ematches.

No problem mate. Just curious about something, since this stuff will 
live under lib, what kind of applications in kernel space could use this 
string matching framework ? I bet that non-net kernel guys will surely 
ask about it ;->

I'll be back working on this stuff next week, expect some feedback.

--
Pablo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-06  1:33     ` Pablo Neira
@ 2005-05-06 12:36       ` Thomas Graf
  2005-05-06 13:04         ` jamal
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Graf @ 2005-05-06 12:36 UTC (permalink / raw)
  To: Pablo Neira; +Cc: netdev, jamal

* Pablo Neira <427AC96E.2020208@eurodev.net> 2005-05-06 03:33
> >This is implemented in skb_find_text(), I just looked through rusty's
> >code and it's very simliar so we could make the args in ts_state a
> >union of long args[6] and char cb[XXX] so we can store skb_iter
> >in there.
> 
> Yes, we could just use one of those args in ts_state to store if we 
> found a partial match or not.

This is not required, I will elaborate the core logic to clarify this.
I guess you still think of the logic as:

function find-text(pattern)
  foreach text-segment do
    call search-bm(text-segment, pattern);
  done
end

This is what most people naturally think of when dealing with this
problem and it might be sufficient for many situations, however it
requires the searching algorithm to store its state in some persistent
memory which can be troublesome depending on the volume of these
variables.

My approach is to save the state of the text fetching progress
and have the algorithm handle the fetching of new data (if needed),
i.e:

function serch-bm(pattern, get-text-callback)
  while get-text-callback() returns more data do
    do the matching here...
  done
end

function find-text(pattern)
  call search-bm(pattern, get-text-callback);
end

This allows the searching algorithm to see the text as a continuous
byte stream. The get_text() callback is like a char device, altough
it returns data in variable sized blocks, there is no random access.

Why? Because it dramatically simplifies the searching algorithms
since they can decide when to fetch the next block which is
absolutely essential for cases like regular expression where the
state machine gets quite complex and a single entry point (i.e.
leaving the function and enter it again with new data) would make
things much more complex. Minor advantages are a) the get_text()
implementation is usually quite small and will hopefully not mess
up the cacheline optimizations of the actual searching loop and
b) it allows the searching algorithm to do prefetching.

<< coffee break >>

Now on to the subject of matching over multiple skbs, in a perfect
world we can try to do async searching by storing the partial match
state in the conntrack entry but then again what do we do when we
find a partial match in a skb? Drop it? We can't access the next
skb or fragment just now to see if the full pattern is present. So
what this would lead to is that we let the handshake and some
payload pass until we find the skb that successfully ends a partial
match. This is basically what IOS and JunOS do (correct me if I'm
wrong). A common use case for this is to filter out malicious HTTP
requests and other easy to identify data that might harm a service
that is vulnerable but has no fix available just yet. In order to
avoid DoS certain limits must be established such as maximum number
of skbs or maximum payload scanned before aborting but most of these
limits can be worked around for example by pretending huge HTTP
headers in front of the malicious header line and it will be very
hard to find a breakeven to actually be able to filter out some of
it while not allowing a DoS due to massive filtering.

Can we avoid letting through the initial handshake? Yes but it's
probably not worth the effort. Will we be able to ultimatively
solve the filtering facitily vs DoS vulnerability issue? No. Can
we avoid letting through previous skbs which could do harm as well
but only a partial match has been found? Not at this level or
without nasty hacks such as acking the data but not passing it up.
Worth it? Probably not. To my understanding it is not worth to
try too hard and be perfect, I see the textsearch filtering as
nice to have for filtering on the first few dozen bytes of the
initial payload skb typically carrying enough headers to classify
on layer7 basis. If someone thinks the small probability of
having a match just between 2 fragments is important then please
go ahead and put the matching state into ts_state as well.

> Sure, your approach is definitely more generic because I just thought 
> about using such string matching framework for net apps, it's different 
> from that point of view, but I still see things that reminds me to my 
> proposed framework, say that ts_state thing ;).

Note that your state variable stores the matching state and ts_state
stores the state of the get_text() progress.

> No problem mate. Just curious about something, since this stuff will 
> live under lib, what kind of applications in kernel space could use this 
> string matching framework ? I bet that non-net kernel guys will surely 
> ask about it ;->

It definitely belongs into lib/ to my understanding, nobody gets
hurt by adding it there and the chances that someone other
subsystem having need for it are not that low.

One example (silly one though) that I did out of curiosity is a
sys_findtext taking an inode returning the first match. It simply
iterates over all blocks belonging to the file making use of the
buffer cache if possible. It's really fast. ;->

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-06 12:36       ` Thomas Graf
@ 2005-05-06 13:04         ` jamal
  2005-05-06 14:43           ` Thomas Graf
  0 siblings, 1 reply; 14+ messages in thread
From: jamal @ 2005-05-06 13:04 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Pablo Neira, netdev

On Fri, 2005-06-05 at 14:36 +0200, Thomas Graf wrote:
[..]

> function find-text(pattern)
>   foreach text-segment do
>     call search-bm(text-segment, pattern);
>   done
> end
> 
> This is what most people naturally think of when dealing with this
> problem and it might be sufficient for many situations, however it
> requires the searching algorithm to store its state in some persistent
> memory which can be troublesome depending on the volume of these
> variables.
> 

[..]

I dont see any difference between the two in terms of state storage.
Can you really avoid storing state? Can you give an example?

> Why? Because it dramatically simplifies the searching algorithms
> since they can decide when to fetch the next block which is
> absolutely essential for cases like regular expression where the
> state machine gets quite complex and a single entry point

The other scheme is to have the callback return a code which says
"get me more text" or " enough/done" in which case the state machine for
the search alg is no different in both cases.

>  (i.e.
> leaving the function and enter it again with new data) would make
> things much more complex. Minor advantages are a) the get_text()
> implementation is usually quite small and will hopefully not mess
> up the cacheline optimizations of the actual searching loop 

It depends on the search algorithm - if it is small, you benefit from
either scheme. If it is large, you loose in both.
 
> and
> b) it allows the searching algorithm to do prefetching.
> 

I am trying to sink this in; prefetching would be valuable for regexp,
but why would the other scheme not be able to do it? 

> << coffee break >>
> 

I have to run to work while you get coffee ;->
I will read the rest later.

cheers,
jamal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-06 13:04         ` jamal
@ 2005-05-06 14:43           ` Thomas Graf
  2005-05-07 13:03             ` Jamal Hadi Salim
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Graf @ 2005-05-06 14:43 UTC (permalink / raw)
  To: jamal; +Cc: Pablo Neira, netdev

* jamal <1115384649.7660.140.camel@localhost.localdomain> 2005-05-06 09:04
> I dont see any difference between the two in terms of state storage.
> Can you really avoid storing state? Can you give an example?

The state is kept on the stack since there is only one call to
the algorithm's search function per match iteration. Look at
the code below, it is called once and fetches the data
continuously until EOF or a match has been found.

get_text() sets *text to the start of the next buffer at the
virtual absolute position `consumed' and returns the length
of the new buffer.

	int i, q = 0, consumed = state->offset;

	for (;;) {
		text_len = conf->get_text(consumed, &text, conf, state);

		if (text_len == 0)
			break; /* no match found */

		for (i = 0; i < text_len; i++) {
			while (q > 0 && pattern[q] != text[i])
				q = prefix_tbl[q - 1];
			if (pattern[q] == text[i])
				q++;
			if (q == pattern_len) {
				state->offset = consumed + i - pattern_len + 1;
				return state->offset;
			}
		}

		consumed += text_len;
	}

A very simple implementation of get_text() could look like this:

static inline int get_linear_text(int offset, unsigned char **text,
				  struct ts_config *conf,
				  struct ts_state *state)
{
	unsigned char *string = (unsigned char *) state->args[0];
	size_t len = state->args[1];

	if (offset < len) {
		*text = string + offset;
		return len - offset;
	} else
		return 0;
}

As you can see, it expects a char * in args[0] and the length of it
in args[1]. All it does is check whether all bytes have been read
already and if not return the remaining part of the buffer so even
if the search algorithm can't consume all the bytes returned it will
still work as expected.

> The other scheme is to have the callback return a code which says
> "get me more text" or " enough/done" in which case the state machine for
> the search alg is no different in both cases.

In the case of interruptable search algorithms, we have to make sure to
save enough state information to continue at the same position. Surely
this is no problem with KMP or BM but gets real complex in the case
of automata based searching algorithms such as regular expressions. It
would mean to have a set of conditional jumps into the algorithms, I
tried it and it really gets complex.

Example: Assume we are are processing a regular expression with automata
approach and are currently look at a 'ANY' (+) recurrence. First thing
we do is check if there are more torkens to parse, if not we can return
true immediately. Otherwise we have to do a look ahead on the pattern
and cycle through the text looking if we can find the token following
the 'ANY' recurrence, e.g.

pattern := "c*\d+A"
current-token := \d+
next-token := 'A'
text := "ccc123A"
current-character := text[3] ('1')

  while NOT next-token matches current-character do
    if current-token matches current-character then
      goto no-match
    endif
    current-character := next(text)
    proceed-to-next-character()
    /* At this position we'd have to return, fetch new text and come back
       right here with the exact same state, difficult huh? */
    if NOT more text to process then
      goto no-match 
    endif
  endwhile

Not sure if this is clear out of context but maybe it gives you an idea
why it is easier to maintain state of get_text() rather than the state
of a whole searching algorithm.

> It depends on the search algorithm - if it is small, you benefit from
> either scheme. If it is large, you loose in both.

That's true, smallish algorithms such as KMP and BM don't gain much
but for complex algorithms such as regexp my approach really is a
lot faster.

> > and
> > b) it allows the searching algorithm to do prefetching.
> > 
> 
> I am trying to sink this in; prefetching would be valuable for regexp,
> but why would the other scheme not be able to do it? 

I'm really not an expert on the validity of L1 caches and how to optimize
it best but I believe that the less memory movement is in between the
more likely prefetching helps? Both schemes involve a switch to another
stack namespace but get_text() tends to be a lot smaller and less intrusive
than a store & reload of a complex state machine. I really can't tell
which is better regarding this subject without trying it out actually.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-06 14:43           ` Thomas Graf
@ 2005-05-07 13:03             ` Jamal Hadi Salim
  2005-05-08 11:45               ` Thomas Graf
  0 siblings, 1 reply; 14+ messages in thread
From: Jamal Hadi Salim @ 2005-05-07 13:03 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev, Pablo Neira

On Fri, 2005-06-05 at 16:43 +0200, Thomas Graf wrote:

> As you can see, it expects a char * in args[0] and the length of it
> in args[1]. All it does is check whether all bytes have been read
> already and if not return the remaining part of the buffer so even
> if the search algorithm can't consume all the bytes returned it will
> still work as expected.
> 

Ok, makes sense - in the case of a string spanning multi skbs, i suppose
it wouldnt matter, correct?

[..] 

> Not sure if this is clear out of context but maybe it gives you an idea
> why it is easier to maintain state of get_text() rather than the state
> of a whole searching algorithm.
> 

I got it. I suppose in the case of text contained within one skb this
would be an improvement (spanning across multi-skb should be no
difference; an improvemengt nonetheless)

> > 
> > I am trying to sink this in; prefetching would be valuable for regexp,
> > but why would the other scheme not be able to do it? 
> 
> I'm really not an expert on the validity of L1 caches and how to optimize
> it best but I believe that the less memory movement is in between the
> more likely prefetching helps? Both schemes involve a switch to another
> stack namespace but get_text() tends to be a lot smaller and less intrusive
> than a store & reload of a complex state machine. I really can't tell
> which is better regarding this subject without trying it out actually.
> 

Sorry - I thought you were talking about pre-fetching text as in
lookahead for text in a regexp state machine.
I am not sure i see the L1 cache connection. Both seem to have tight for
loops and depending on the algorithm there would be no difference
in cache warmth afaics. Infact your scheme may suffer more because it
has a lot of stuff on the stack. However, playing around with the code
is the only way to find out.

cheers,
jamal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-07 13:03             ` Jamal Hadi Salim
@ 2005-05-08 11:45               ` Thomas Graf
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Graf @ 2005-05-08 11:45 UTC (permalink / raw)
  To: Jamal Hadi Salim; +Cc: netdev, Pablo Neira

* Jamal Hadi Salim <1115470985.19561.58.camel@localhost.localdomain> 2005-05-07 09:03
> On Fri, 2005-06-05 at 16:43 +0200, Thomas Graf wrote:
> 
> > As you can see, it expects a char * in args[0] and the length of it
> > in args[1]. All it does is check whether all bytes have been read
> > already and if not return the remaining part of the buffer so even
> > if the search algorithm can't consume all the bytes returned it will
> > still work as expected.
> > 
> 
> Ok, makes sense - in the case of a string spanning multi skbs, i suppose
> it wouldnt matter, correct?

Strings spanning multiple skbs can be handled just like a paged skbs
_iff_ all skbs are known at the point the search starts, otherwise
we'll need more trickering.

> Sorry - I thought you were talking about pre-fetching text as in
> lookahead for text in a regexp state machine.
> I am not sure i see the L1 cache connection. Both seem to have tight for
> loops and depending on the algorithm there would be no difference
> in cache warmth afaics. Infact your scheme may suffer more because it
> has a lot of stuff on the stack. However, playing around with the code
> is the only way to find out.

You're probably right, I suspect mine to perform better because in a
typical search there is less work to do per loop and the interruption
to fetch the next block of text is less intrusive. Time will tell.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-04 23:40 [RFC] textsearch infrastructure + skb_find_text() Thomas Graf
  2005-05-05 12:42 ` jamal
  2005-05-05 17:02 ` Pablo Neira
@ 2005-05-06 21:44 ` Thomas Graf
  2005-05-07  0:17   ` YOSHIFUJI Hideaki / 吉藤英明
  2 siblings, 1 reply; 14+ messages in thread
From: Thomas Graf @ 2005-05-06 21:44 UTC (permalink / raw)
  To: netdev; +Cc: Pablo Neira

Quite silly regular expression implementation for textsearch
infrastructure. Still has issues with textsearch_next() and
can probably be optimized a lot but should be a good base
to work on.

diff -Nru -X dontdiff linux-2.6.12-rc3.orig/include/linux/textsearch_regexp.h linux-2.6.12-rc3/include/linux/textsearch_regexp.h
--- linux-2.6.12-rc3.orig/include/linux/textsearch_regexp.h	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3/include/linux/textsearch_regexp.h	2005-05-06 19:33:20.000000000 +0200
@@ -0,0 +1,47 @@
+#ifndef __LINUX_TEXTSEARCH_REGEXP_H
+#define __LINUX_TEXTSEARCH_REGEXP_H
+
+#include <linux/types.h>
+
+enum {
+	TS_RE_SPECIFIC,		/* specific character */
+	TS_RE_WILDCARD,		/* any character */
+	TS_RE_DIGIT,		/* isdigit() */
+	TS_RE_XDIGIT,		/* isxdigit() */
+	TS_RE_PRINT,		/* isprint() */
+	TS_RE_ALPHA,		/* isalpha() */
+	TS_RE_ALNUM,		/* isalnum() */
+	TS_RE_ASCII,		/* isascii() */
+	TS_RE_CNTRL,		/* iscntrl() */
+	TS_RE_GRAPH,		/* isgraph() */
+	TS_RE_LOWER,		/* islower() */
+	TS_RE_UPPER,		/* isupper() */
+	TS_RE_PUNCT,		/* ispunct() */
+	TS_RE_SPACE,		/* isspace() */
+	__TS_RE_TYPE_MAX,
+};
+#define TS_RE_TYPE_MAX (__TS_RE_TYPE_MAX - 1)
+
+enum {
+	TS_RE_SINGLE,		/* 1 occurrence */
+	TS_RE_PERHAPS,		/* 1 or 0 occurrence */
+	TS_RE_ANY,		/* 0..n occurrences */
+	TS_RE_MULTI,		/* 1..n occurrences */
+	__TS_RE_RECUR_MAX,
+};
+#define TS_RE_RECUR_MAX (__TS_RE_RECUR_MAX - 1)
+
+/**
+ * struct ts_regexp_token - regular expression token
+ * @type: type of token
+ * @recur: number of recurrences
+ * @value: character value for TS_RE_SPECIFIC
+ */
+struct ts_regexp_token
+{
+	__u16		type;
+	__u8		recur;
+	__u8		value;
+};
+
+#endif
diff -Nru -X dontdiff linux-2.6.12-rc3.orig/lib/ts_regexp.c linux-2.6.12-rc3/lib/ts_regexp.c
--- linux-2.6.12-rc3.orig/lib/ts_regexp.c	1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.12-rc3/lib/ts_regexp.c	2005-05-06 23:13:52.000000000 +0200
@@ -0,0 +1,253 @@
+/*
+ * lib/ts_regexp.c	Naive and very limited regular expression
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Thomas Graf <tgraf@suug.ch>
+ */
+
+#include <linux/config.h>
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/textsearch.h>
+#include <linux/textsearch_regexp.h>
+
+#define TS_RE_TOKENS(conf) \
+	((struct ts_regexp_token *) ((unsigned char *) (conf) \
+				     + sizeof(struct ts_config)))
+
+/* other values derived from ctype.h */
+#define _A		0x100 /* ascii */
+#define _W		0x200 /* wildcard */
+
+/* Map to _ctype flags and some magic numbers */
+static u16 token_map[TS_RE_TYPE_MAX+1] = {
+	[TS_RE_SPECIFIC]  = 0,
+	[TS_RE_WILDCARD]  = _W,
+	[TS_RE_CNTRL]	  = _C,
+	[TS_RE_LOWER]	  = _L,
+	[TS_RE_UPPER]	  = _U,
+	[TS_RE_PUNCT]	  = _P,
+	[TS_RE_SPACE]	  = _S,
+	[TS_RE_DIGIT]	  = _D,
+	[TS_RE_XDIGIT]	  = _D | _X,
+	[TS_RE_ALPHA]	  = _U | _L,
+	[TS_RE_ALNUM]	  = _U | _L | _D,
+	[TS_RE_PRINT]	  = _P | _U | _L | _D | _SP,
+	[TS_RE_GRAPH]	  = _P | _U | _L | _D,
+	[TS_RE_ASCII]	  = _A,
+};
+
+static u16 token_lookup_tbl[256] = {
+_W|_A|_C,      _W|_A|_C,     _W|_A|_C,     _W|_A|_C,		/*   0-  3 */
+_W|_A|_C,      _W|_A|_C,     _W|_A|_C,     _W|_A|_C,		/*   4-  7 */
+_W|_A|_C,      _W|_A|_C|_S,  _W|_A|_C|_S,  _W|_A|_C|_S,		/*   8- 11 */
+_W|_A|_C|_S,   _W|_A|_C|_S,  _W|_A|_C,     _W|_A|_C,		/*  12- 15 */
+_W|_A|_C,      _W|_A|_C,     _W|_A|_C,     _W|_A|_C,		/*  16- 19 */
+_W|_A|_C,      _W|_A|_C,     _W|_A|_C,     _W|_A|_C,		/*  20- 23 */
+_W|_A|_C,      _W|_A|_C,     _W|_A|_C,     _W|_A|_C,		/*  24- 27 */
+_W|_A|_C,      _W|_A|_C,     _W|_A|_C,     _W|_A|_C,		/*  28- 31 */
+_W|_A|_S|_SP,  _W|_A|_P,     _W|_A|_P,     _W|_A|_P,		/*  32- 35 */
+_W|_A|_P,      _W|_A|_P,     _W|_A|_P,     _W|_A|_P,		/*  36- 39 */
+_W|_A|_P,      _W|_A|_P,     _W|_A|_P,     _W|_A|_P,		/*  40- 43 */
+_W|_A|_P,      _W|_A|_P,     _W|_A|_P,     _W|_A|_P,		/*  44- 47 */
+_W|_A|_D,      _W|_A|_D,     _W|_A|_D,     _W|_A|_D,		/*  48- 51 */
+_W|_A|_D,      _W|_A|_D,     _W|_A|_D,     _W|_A|_D,		/*  52- 55 */
+_W|_A|_D,      _W|_A|_D,     _W|_A|_P,     _W|_A|_P,		/*  56- 59 */
+_W|_A|_P,      _W|_A|_P,     _W|_A|_P,     _W|_A|_P,		/*  60- 63 */
+_W|_A|_P,      _W|_A|_U|_X,  _W|_A|_U|_X,  _W|_A|_U|_X,		/*  64- 67 */
+_W|_A|_U|_X,   _W|_A|_U|_X,  _W|_A|_U|_X,  _W|_A|_U,		/*  68- 71 */
+_W|_A|_U,      _W|_A|_U,     _W|_A|_U,     _W|_A|_U,		/*  72- 75 */
+_W|_A|_U,      _W|_A|_U,     _W|_A|_U,     _W|_A|_U,		/*  76- 79 */
+_W|_A|_U,      _W|_A|_U,     _W|_A|_U,     _W|_A|_U,		/*  80- 83 */
+_W|_A|_U,      _W|_A|_U,     _W|_A|_U,     _W|_A|_U,		/*  84- 87 */
+_W|_A|_U,      _W|_A|_U,     _W|_A|_U,     _W|_A|_P,		/*  88- 91 */
+_W|_A|_P,      _W|_A|_P,     _W|_A|_P,     _W|_A|_P,		/*  92- 95 */
+_W|_A|_P,      _W|_A|_L|_X,  _W|_A|_L|_X,  _W|_A|_L|_X,		/*  96- 99 */
+_W|_A|_L|_X,   _W|_A|_L|_X,  _W|_A|_L|_X,  _W|_A|_L,		/* 100-103 */
+_W|_A|_L,      _W|_A|_L,     _W|_A|_L,     _W|_A|_L,		/* 104-107 */
+_W|_A|_L,      _W|_A|_L,     _W|_A|_L,     _W|_A|_L,		/* 108-111 */
+_W|_A|_L,      _W|_A|_L,     _W|_A|_L,     _W|_A|_L,		/* 112-115 */
+_W|_A|_L,      _W|_A|_L,     _W|_A|_L,     _W|_A|_L,		/* 116-119 */
+_W|_A|_L,      _W|_A|_L,     _W|_A|_L,     _W|_A|_P,		/* 120-123 */
+_W|_A|_P,      _W|_A|_P,     _W|_A|_P,     _W|_A|_C,		/* 124-127 */
+_W,            _W,           _W,           _W,			/* 128-131 */
+_W,            _W,           _W,           _W,			/* 132-135 */
+_W,            _W,           _W,           _W,			/* 136-139 */
+_W,            _W,           _W,           _W,			/* 140-143 */
+_W,            _W,           _W,           _W,			/* 144-147 */
+_W,            _W,           _W,           _W,			/* 148-151 */
+_W,            _W,           _W,           _W,			/* 152-155 */
+_W,            _W,           _W,           _W,			/* 156-159 */
+_W|_S|_SP,     _W|_P,        _W|_P,        _W|_P,		/* 160-163 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 164-167 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 168-171 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 172-175 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 176-179 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 180-183 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 184-187 */
+_W|_P,         _W|_P,        _W|_P,        _W|_P,		/* 188-191 */
+_W|_U,         _W|_U,        _W|_U,        _W|_U,		/* 192-195 */
+_W|_U,         _W|_U,        _W|_U,        _W|_U,		/* 196-199 */
+_W|_U,         _W|_U,        _W|_U,        _W|_U,		/* 200-203 */
+_W|_U,         _W|_U,        _W|_U,        _W|_U,		/* 204-207 */
+_W|_U,         _W|_U,        _W|_U,        _W|_U,		/* 208-211 */
+_W|_U,         _W|_U,        _W|_U,        _W|_P,		/* 212-215 */
+_W|_U,         _W|_U,        _W|_U,        _W|_U,		/* 216-219 */
+_W|_U,         _W|_U,        _W|_U,        _W|_L,		/* 220-223 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L,		/* 224-227 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L,		/* 228-231 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L,		/* 232-235 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L,		/* 236-239 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L,		/* 240-243 */
+_W|_L,         _W|_L,        _W|_L,        _W|_P,		/* 244-247 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L,		/* 248-251 */
+_W|_L,         _W|_L,        _W|_L,        _W|_L};		/* 252-255 */
+
+static inline int match_token(struct ts_regexp_token *t, u8 d)
+{
+	if (t->type)
+		return (token_lookup_tbl[d] & t->type) != 0;
+	else
+		return t->value == d;
+}
+
+static int regexp_find(struct ts_config *conf, struct ts_state *state)
+{
+	int i, q, consumed = state->offset;
+	struct ts_regexp_token *t = NULL, *next;
+	struct ts_regexp_token *tokens = TS_RE_TOKENS(conf);
+	int ntokens = conf->pattern_len / sizeof(*t);
+	size_t text_len = 0;
+	unsigned char *text;
+
+#define GET_TEXT() \
+	({ i = 0; conf->get_text(consumed, &text, conf, state); })
+
+#define end_of_text()  (i >= text_len && !GET_TEXT())
+#define more_text() (i < text_len || GET_TEXT())
+
+#define NEXT_CHAR() do { i++; consumed++; } while (0)
+	
+	for (i = 0, q = 0; q < ntokens; q++) {
+		t = &tokens[q];
+
+		if (likely(q < (ntokens - 1)))
+			next = &tokens[q+1];
+		else
+			next = NULL;
+
+		switch (t->recur) {
+			case TS_RE_SINGLE:
+				if (unlikely(end_of_text()))
+					goto no_match;
+
+				if (!match_token(t, text[i]))
+					goto no_match;
+				break;
+
+			case TS_RE_PERHAPS:
+				if (likely(more_text()))
+					if (match_token(t, text[i]))
+						break;
+				continue;
+
+			case TS_RE_MULTI:
+				if (unlikely(end_of_text()))
+					goto no_match;
+
+				if (!match_token(t, text[i]))
+					goto no_match;
+
+				NEXT_CHAR();
+				/* fall through */
+
+			case TS_RE_ANY:
+				if (next == NULL)
+					goto found_match;
+
+				if (likely(more_text())) {
+					while (!match_token(next, text[i])) {
+						if (!match_token(t, text[i]))
+							goto no_match;
+						NEXT_CHAR();
+						if (unlikely(end_of_text()))
+							goto no_match;
+					}
+				}
+				continue;
+		}
+
+		NEXT_CHAR();
+	}
+
+	if (q >= (ntokens - 1))
+		goto found_match;
+
+no_match:
+	return -1;
+
+found_match:
+	return 0;
+}
+
+static struct ts_config *regexp_init(const unsigned char *pattern, size_t len,
+				     int gfp_mask)
+{
+	int i, err = -EINVAL;
+	struct ts_config *conf;
+	struct ts_regexp_token *tokens = (struct ts_regexp_token *) pattern;
+
+	if (len  % sizeof(struct ts_regexp_token))
+		goto errout;
+
+	for (i = 0; i < len / sizeof(struct ts_regexp_token); i++) {
+		struct ts_regexp_token *t = &tokens[i];
+
+		if (t->type > TS_RE_TYPE_MAX ||
+		    t->type > TS_RE_RECUR_MAX)
+			goto errout;
+
+		t->type = token_map[t->type];
+	}
+
+	conf = alloc_ts_config(len, gfp_mask);
+	if (IS_ERR(conf))
+		return conf;
+
+	conf->pattern_len = len;
+	memcpy(TS_RE_TOKENS(conf), pattern, len);
+
+	return conf;
+
+errout:
+	return ERR_PTR(err);
+}
+
+static struct ts_ops regexp_ops = {
+	.name	= "regexp",
+	.find	= regexp_find,
+	.init	= regexp_init,
+	.owner	= THIS_MODULE,
+	.list	= LIST_HEAD_INIT(regexp_ops.list)
+};
+
+static int __init init_regexp(void)
+{
+	return textsearch_register(&regexp_ops);
+}
+
+static void __exit exit_regexp(void)
+{
+	textsearch_unregister(&regexp_ops);
+}
+
+MODULE_LICENSE("GPL");
+
+module_init(init_regexp);
+module_exit(exit_regexp);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-06 21:44 ` Thomas Graf
@ 2005-05-07  0:17   ` YOSHIFUJI Hideaki / 吉藤英明
  2005-05-07  0:36     ` Thomas Graf
  0 siblings, 1 reply; 14+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2005-05-07  0:17 UTC (permalink / raw)
  To: tgraf; +Cc: netdev, pablo, yoshfuji

In article <20050506214437.GH28419@postel.suug.ch> (at Fri, 6 May 2005 23:44:37 +0200), Thomas Graf <tgraf@suug.ch> says:

> Quite silly regular expression implementation for textsearch
> infrastructure. Still has issues with textsearch_next() and
> can probably be optimized a lot but should be a good base
> to work on.

Would you add some reference(s) on algorithm etc.?
Thanks.

--yoshfuji

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC] textsearch infrastructure + skb_find_text()
  2005-05-07  0:17   ` YOSHIFUJI Hideaki / 吉藤英明
@ 2005-05-07  0:36     ` Thomas Graf
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Graf @ 2005-05-07  0:36 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / ?$B5HF#1QL@; +Cc: netdev, pablo

* YOSHIFUJI Hideaki / ?$B5HF#1QL@ <20050507.091744.106860599.yoshfuji@linux-ipv6.org> 2005-05-07 09:17
> In article <20050506214437.GH28419@postel.suug.ch> (at Fri, 6 May 2005 23:44:37 +0200), Thomas Graf <tgraf@suug.ch> says:
> 
> > Quite silly regular expression implementation for textsearch
> > infrastructure. Still has issues with textsearch_next() and
> > can probably be optimized a lot but should be a good base
> > to work on.
> 
> Would you add some reference(s) on algorithm etc.?

Comes out of my own mind which means it's probably pretty poor, it
simply takes a deterministic finite state machine and runs it on
the text.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2005-05-08 11:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-04 23:40 [RFC] textsearch infrastructure + skb_find_text() Thomas Graf
2005-05-05 12:42 ` jamal
2005-05-05 14:12   ` Thomas Graf
2005-05-05 17:02 ` Pablo Neira
2005-05-05 17:42   ` Thomas Graf
2005-05-06  1:33     ` Pablo Neira
2005-05-06 12:36       ` Thomas Graf
2005-05-06 13:04         ` jamal
2005-05-06 14:43           ` Thomas Graf
2005-05-07 13:03             ` Jamal Hadi Salim
2005-05-08 11:45               ` Thomas Graf
2005-05-06 21:44 ` Thomas Graf
2005-05-07  0:17   ` YOSHIFUJI Hideaki / 吉藤英明
2005-05-07  0:36     ` Thomas Graf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).