From ruby-talk-admin@ruby-lang.org Fri Dec 16 02:45:11 2005 Received: from kankan.nagaokaut.ac.jp (kankan.nagaokaut.ac.jp [133.44.2.24]) by blade.nagaokaut.ac.jp (8.12.3/8.12.3/Debian-6.6) with ESMTP id jBFHjBZk030014; Fri, 16 Dec 2005 02:45:11 +0900 Received: from funfun.nagaokaut.ac.jp (funfun.nagaokaut.ac.jp [133.44.2.201]) by kankan.nagaokaut.ac.jp (Postfix) with ESMTP id EC05D5ACA; Fri, 16 Dec 2005 02:45:15 +0900 (JST) Received: from localhost (localhost.nagaokaut.ac.jp [127.0.0.1]) by funfun.nagaokaut.ac.jp (Postfix) with ESMTP id 08949F04846; Fri, 16 Dec 2005 02:45:16 +0900 (JST) Received: from voscc.nagaokaut.ac.jp (voscc.nagaokaut.ac.jp [133.44.1.100]) by funfun.nagaokaut.ac.jp (Postfix) with ESMTP id 65145F0486B; Fri, 16 Dec 2005 02:45:14 +0900 (JST) Received: from beryllium.ruby-lang.org (beryllium.ruby-lang.org [210.163.138.100]) by voscc.nagaokaut.ac.jp (Postfix) with ESMTP id 5EFDB63002B; Fri, 16 Dec 2005 02:45:14 +0900 (JST) Received: from beryllium.ruby-lang.org (beryllium.ruby-lang.org [127.0.0.1]) by beryllium.ruby-lang.org (Postfix) with ESMTP id 6B62433DD9; Fri, 16 Dec 2005 02:44:15 +0900 (JST) Received: from localhost (beryllium.ruby-lang.org [127.0.0.1]) by beryllium.ruby-lang.org (Postfix) with ESMTP id E70F233DCE for ; Fri, 16 Dec 2005 02:44:05 +0900 (JST) Received: from beryllium.ruby-lang.org ([127.0.0.1]) by localhost (beryllium.ruby-lang.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 14476-03 for ; Fri, 16 Dec 2005 02:44:05 +0900 (JST) Received: from wproxy.gmail.com (wproxy.gmail.com [64.233.184.207]) by beryllium.ruby-lang.org (Postfix) with ESMTP id F068D33A28 for ; Fri, 16 Dec 2005 02:44:04 +0900 (JST) Received: by wproxy.gmail.com with SMTP id i31so422813wra for ; Thu, 15 Dec 2005 09:44:00 -0800 (PST) Received: by 10.54.140.16 with SMTP id n16mr935166wrd; Thu, 15 Dec 2005 09:44:00 -0800 (PST) Received: by 10.54.113.5 with HTTP; Thu, 15 Dec 2005 09:44:00 -0800 (PST) Delivered-To: ruby-talk@ruby-lang.org Date: Fri, 16 Dec 2005 02:44:06 +0900 Posted: Thu, 15 Dec 2005 12:44:00 -0500 From: Garance A Drosehn Reply-To: ruby-talk@ruby-lang.org Subject: Re: regular expressions question To: ruby-talk@ruby-lang.org (ruby-talk ML) Message-Id: <97880ae0512150944n6d0dff05mc658eb4b50b6e386@mail.gmail.com> In-Reply-To: <40cuppF1a1hsvU1@individual.net> References: <1134594056.651873.15620@z14g2000cwz.googlegroups.com> <1134611628.252331.35290@g14g2000cwa.googlegroups.com> <090501c6011b$a5fffe40$6442a8c0@musicbox> <1134612631.025624.97920@g47g2000cwa.googlegroups.com> <091e01c60120$f2313450$6442a8c0@musicbox> <1134615035.758396.207420@f14g2000cwb.googlegroups.com> <40cuppF1a1hsvU1@individual.net> X-ML-Name: ruby-talk X-Mail-Count: 80 X-MLServer: fml [fml 4.0.3 release (20011202/4.0.3)]; post only (only members can post) X-ML-Info: If you have a question, send e-mail with the body "help" (without quotes) to the address ruby-talk-ctl@ruby-lang.org; help= X-Original-To: ruby-talk@ruby-lang.org DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=DZ8XqQAduMZCSECx2IQocVZkXuNQ2A3lzOvqxrU5SuFfPaQusgUmU9YWAp4yEw4wLX57zKy0lVkxbXBnfBPr3s+z600pEup2vfuOWTY12kML6bOm/YYtJUdzNDPIbQcDe277x1XzjaAzlcItNTkSyCnwpAuKlM2VM+eHwJQyVXo= Content-Disposition: inline X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at ruby-lang.org X-Spam-Checker-Version: SpamAssassin 3.0.3 (2005-04-27) on beryllium.ruby-lang.org X-Spam-Level: X-Spam-Status: No, score=-11.4 required=7.0 tests=AWL,BAYES_00,BLARS00, BLARS_SPAM00,CONTENT_TYPE_PRESENT,MIMEQENC,QENCPTR1,QENCPTR2, RCVDFRMLOCALIP,RCVD_BY_IP,RCVD_IN_BLARS,RCVD_IN_BLARS_HOOPS, RCVD_IN_BLARS_SPAM autolearn=ham version=3.0.3 Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Precedence: bulk Lines: 88 List-Id: ruby-talk.ruby-lang.org List-Software: fml [fml 4.0.3 release (20011202/4.0.3)] List-Post: List-Owner: List-Help: List-Unsubscribe: X-Virus-Scanned: by AMaViS snapshot-20020531 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by blade.nagaokaut.ac.jp id jBFHjBZk030014 On 12/15/05, Robert Klemme wrote: > ako... wrote: > > thank you. yes, it seems to be the only way. just that it is a shame > > that we have to match the same expression again! the information was > > available already, it was just discarded during the first match in > > your sample. > > I still didn't get what exactly you want. Does this help? > > >> 'a,b ,c'.split /\s*,\s*/ > => ["a", "b", "c"] Now that I've read the responses in this thread a few times, I think I understand what he wants to do. And I don't think it can be done via scan. First: He wants a single regex which will verify the syntax of an entire line. So, first he wants a true/false value, saying "The line is valid, or it is not valid". Never mind any values in the line, just "is the line *completely valid*?". Then, if the line is valid, he wants to break out individual pieces of what was scanned, and he wants to do that without re-doing any of the scans he did in the first regex. The trick is that some of those pieces are a repeating group, such as /(\s\w)*/. What is confusing us is that he describes this using a simple example, and when we solve the simple example he then says "you don't get the bigger picture!". Ugh. Let me give an example, and see if someone can solve it. My example might still be something other than what he's thinking of, but maybe it will help. Let's say I'm expecting command lines of the form: first word is either 'copy' or 'duplicate' followed by one or more words followed by the word 'before' or 'after' followed by one or more words So I could do the first step with the regexp: /^(copy|duplicate) \s+ (\w+\s+)+ (before|after) \s+ (\w+\s*)+ $/x (hopefully I've done that right!). *IF* that matches, then I know the entire line is valid. Then, after I know the line is valid, I want the array of source-words, and the array of destination-words which were matched. I want to do that by picking out information in Matchdata, not by doing a new scan. The thing is, I don't think I have a way of knowing how many times the first '(\d+\s+)+' was matched. So I can't just do a slice of $~.captures because I don't know what the starting and ending indexes of that slice would be. I could put another set of parenthesis around the two repeating groups: /^(copy|duplicate) \s+ ((\w+\s+)+) (before|after) \s+ ((\w+\s*)+) $/x But that doesn't really give me two separate arrays of the individual values that made up each group. It just matches each group as a whole. Given two data lines of: copy apple pear plum peach after bill bob duplicate tomato before joe alice alfred tommy jane in the first case I want a way to set two arrays: srcfood = ["apple ", "pear ", "plum ", "peach "] destword = ["bill ", "bob"] from the first line, and srcfood = ["tomato "] destword = ["joe ", "alice", "alfred ", "tommy ", "jane"] from the second line. I'll agree this is a weird example, but I think it shows the issue. If I apply the above pattern to the first line, I'll see a Matchdata result where: $~.captures == ["copy", "apple pear plum peach ", "peach ", "after", "bill bob", "bob"] Notice: There isn't *any* element which contains a value of just "apple ", or just "pear ", or just "plum ", even though the regex obviously had to match each one of those. -- Garance Alistair Drosehn = drosihn@gmail.com Senior Systems Programmer or gad@FreeBSD.org Rensselaer Polytechnic Institute; Troy, NY; USA