On Sun, 2001-10-14 at 21:42, Bill Kelly wrote:
> 
> From: "Sean Middleditch" <elanthis / awesomeplay.com>
> >
> > On Sun, 2001-10-14 at 20:02, Bill Kelly wrote:
> [...]
> > > take it that ought to tokenize to
> > > 'abc', 'def', '\"abc', '123,456\,xxy' ??????
> > 
> > Ya, that was the tokenization I was looken for. 
> > 
> > I don't think I've ever seen an app do that, but after some
> > inexperienced user decides to go hadn tweak stuff, things can get
> > ugly...
> > 
> > No, Iv'e never seen taht, but I am also a worst case scenario type
> > person.  ^,^  Also, I look at the rules of what the syntax means, and I
> > always make sure my code can completely follow the rules no matter how
> > weird.  There is a 1 in a trillion chance it's needed, but oh well.  I'm
> > weird like that.  ^,^
> 
> Well, indeed, I think I may be weird the same way.  =)  But so far
> I'd been trying to infer the syntax from just the examples posted.
> (Also, sometimes the user might not be able to tweak the data- we
> have an HTML parser that's more of an XML parser at heart - doesn't
> deal with "optional closing tags" like HTML does - but, we're the
> ones in control of all HTML produced that needs to be parsed by
> that parser - so in cases like that I make exceptions and try to
> come up with the simplest (most self-consistent) syntax that can be
> expressed, and have the parser or tokenizier code simplified as
> a result [keeping in mind that if we need more "advanced" 
> functionality we can always add it as the need arises] :)
> 

I just, in this case, think in terms of the shell, and how it handles
quotes and backslashes.  It really is simple, I suppose, if you're not
using regexp to represent the whoel things.  ^,^

> BTW I just ran a test through Excel, and what it output as the
> CSV version of the data was, IMHO, abhorrent.  Eww.  It took:
> 
> +-----+---------------+
> |blerb|"hey,you","man,|
> +-----+---------------+
> 
> And wrote (to the .csv file):
> 
> blerb,"""hey,you"",""man,"
> 
> 
> . . . . huh????  My guess is that if there's a "double immediate
> nested quote", it becomes a single quote in the output.
> 

Wow.  That's pretty screwed up.  The anti-microsoft person in me
immediately says "They do that to make cross platform and/or cross
product communication harder!" but it's probably just because the
microsoft programmers never actually bother to use the export feature
themselves.  ^,^

> . . . Okay, in puzzling over it a bit more I guess it's not >that<
> bad. . .   It would seem a leading quote is a special case meaning
> the whole field is quoted - so far pretty standard (in a "quoted-
> vs.-non-quoted field" sense) . . . but then double-contiguous-quotes 
> appearing _within_ that string are escaped to single double-quotes
> in the output.
> 
> Hmm . . . .   Guess that wouldn't be *too* bad in a regexp.  =-)

STill ugly though.  ~,^

> 
> [...]
> > > def csv_split(str)
> > >     str.scan(/(?:\A|,)\s*"((?:\\"|[^"])*)"|(?:\A|,)([^",]*|[^",][^,]*)(?=,|\z)/).flatten!.compact!
> > > end
> > > 
> > 
> > Jeez, Iit would take me hours to come up with that.  If employers took
> > regexp's on resumes, you could get a hell of a job with that.  ~,^
> 
> Hehe thanks.  It's probably more structured than it looks.  For
> instance you could get rid of the \A's by prepending a leading
> comma to the string being processed, prior to feeding it to the
> regexp.  It's really just matching:
> 

Dude, I look at that regexp and my eyes stop seeing color, I start
seeing dark spots, and little attack sheep start fighting in front of my
face.  ^,^  I can't even *begin* to parse that stuff.

>   ,whatever,
> 
> or
>   , "whatever" ,
> 
> . . . with some fancy stuff to allow the leading comma to also
> be the start of the string, and the trailing comma [or end-of
> string] to not be included in the actual match (so the subsequent
> match picks up there.)
> 
> 
> Regards,
> 
> Bill
> 
> 
>