From: "Sean Middleditch" <elanthis / awesomeplay.com>
>
> On Sun, 2001-10-14 at 20:02, Bill Kelly wrote:
[...]
> > take it that ought to tokenize to
> > 'abc', 'def', '\"abc', '123,456\,xxy' ??????
> 
> Ya, that was the tokenization I was looken for. 
> 
> I don't think I've ever seen an app do that, but after some
> inexperienced user decides to go hadn tweak stuff, things can get
> ugly...
> 
> No, Iv'e never seen taht, but I am also a worst case scenario type
> person.  ^,^  Also, I look at the rules of what the syntax means, and I
> always make sure my code can completely follow the rules no matter how
> weird.  There is a 1 in a trillion chance it's needed, but oh well.  I'm
> weird like that.  ^,^

Well, indeed, I think I may be weird the same way.  =)  But so far
I'd been trying to infer the syntax from just the examples posted.
(Also, sometimes the user might not be able to tweak the data- we
have an HTML parser that's more of an XML parser at heart - doesn't
deal with "optional closing tags" like HTML does - but, we're the
ones in control of all HTML produced that needs to be parsed by
that parser - so in cases like that I make exceptions and try to
come up with the simplest (most self-consistent) syntax that can be
expressed, and have the parser or tokenizier code simplified as
a result [keeping in mind that if we need more "advanced" 
functionality we can always add it as the need arises] :)

BTW I just ran a test through Excel, and what it output as the
CSV version of the data was, IMHO, abhorrent.  Eww.  It took:

+-----+---------------+
|blerb|"hey,you","man,|
+-----+---------------+

And wrote (to the .csv file):

blerb,"""hey,you"",""man,"


. . . . huh????  My guess is that if there's a "double immediate
nested quote", it becomes a single quote in the output.

. . . Okay, in puzzling over it a bit more I guess it's not >that<
bad. . .   It would seem a leading quote is a special case meaning
the whole field is quoted - so far pretty standard (in a "quoted-
vs.-non-quoted field" sense) . . . but then double-contiguous-quotes 
appearing _within_ that string are escaped to single double-quotes
in the output.

Hmm . . . .   Guess that wouldn't be *too* bad in a regexp.  =-)

[...]
> > def csv_split(str)
> >     str.scan(/(?:\A|,)\s*"((?:\\"|[^"])*)"|(?:\A|,)([^",]*|[^",][^,]*)(?=,|\z)/).flatten!.compact!
> > end
> > 
> 
> Jeez, Iit would take me hours to come up with that.  If employers took
> regexp's on resumes, you could get a hell of a job with that.  ~,^

Hehe thanks.  It's probably more structured than it looks.  For
instance you could get rid of the \A's by prepending a leading
comma to the string being processed, prior to feeding it to the
regexp.  It's really just matching:

  ,whatever,

or
  , "whatever" ,

. . . with some fancy stuff to allow the leading comma to also
be the start of the string, and the trailing comma [or end-of
string] to not be included in the actual match (so the subsequent
match picks up there.)


Regards,

Bill