#if Bill Kelly > Hi, I'm somewhat of a regexp-enjoying fiend myself; but after > Randal Schwartz posted this horrendous thing last March: > http://www.ruby-talk.com/cgi-bin/scat.rb/ruby/ruby-talk/12815 > I've been somewhat wary whenever someone mentions RFC822. :-) It's a badly written RFC. The grammar is much more complex than it needs to be, and there are many errors in the document, so I understand if you run away quick :) > However if the tokenizer you describe above is truly just what > you need, it sounds kinda like a one-liner. Maybe :) What do > you mean by "a set of delimeter characters are provided." ? > Are they known beforehand, or variable with each invocation? > Can quotes be escaped \" within the quoted string? The delimeter set is variable with each invocation, but never '"', '(' or '\' (which are handled specially be the tokenizer anyway) Yes, you can have \" within "" I'm actually down to only three calls in the whole parser, having replaced everything else with regexps while porting to Ruby. tokenize(@strRep, ",\n\r", true, false) tokenize(@strRep, " \n\r", false, true) tokenize(@strRep, "@", false, false) Note: last two params mean 'skip comments' ( anything in () ) and 'quoted tokens' (anything in "" should be treated as a token, keep the surrounding quotes) e.g. Delimiter: . Input: The.(quick)."brown.fox".jumps.("over").the."lazy.\"dog\"" Output: <The> <"brown.fox"> <jumps> <the> <"lazy."dog""> Some relevant stuff from the RFC (stuff after ';' is their comments) qtext = <any CHAR excepting <">, ; => may be folded "\" & CR, and including linear-white-space> quoted-pair = "\" CHAR ; may quote any char quoted-string = <"> *(qtext/quoted-pair) <"> ; Regular qtext or ; quoted chars. CHAR = <any ASCII character> ; ( 0-177, 0.-127.) linear-white-space = 1*([CRLF] LWSP-char) ; semantics = SPACE ; CRLF => folding LWSP-char = SPACE / HTAB ; semantics = SPACE Nice that they couldn't be bothered to make a full BNF grammar, eh ? ;) Rik