--DIOMP1UsTsWJauNi
Content-Type: multipart/mixed; boundary="LpQ9ahxlCli8rRTG"
Content-Disposition: inline


--LpQ9ahxlCli8rRTG
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* noreply / rubyforge.org (noreply / rubyforge.org) wrote:
> Summary: String#scan loops forefever if scanned string is modified inside=
 block.

The subject doesn't really reflect what's actually happening.

> Initial Comment:
> ruby 1.8.4 (2005-12-24)
>=20
> Following code loops infinitely:
>=20
> a =3D " 12345678 "; a.scan(/\d/) {|s| a[3,2]=3D'test';  s}=20

I'm not convinced this is a bug per-se.  At least not any more than=20
"loop { }" is.  What's actually happening is easier to demonstrate than
explain, so here goes (I'm using the caret as the position indicator).

  " 12345678"
   ^ #=3D> no match
  " 12345678"
    ^ #=3D> match, a =3D " 01test45678 "
  " 12test45678 "
     ^ #=3D> match, a =3D " 12testst5678 "
  " 12testst5678 "
      ^ #=3D> no match
    ... (snipped several irrelevant steps)
  " 12testst5678 "
           ^ #=3D> no match
  " 12testst5678 "
            ^ #=3D> match, a =3D " 12teststst5678 "  <-- eek!
  " 12teststst5678 " =20
             ^ #=3D> no match
  " 12teststst5678 " =20
              ^ #=3D> match, a =3D " 12testststst5678 "
  " 12testststst5678 "
               ^ #=3D> no match
  " 12testststst5678 "
                ^ #=3D> match, a =3D " 12teststststst5678 "
  (and so on, ad infinitum)

What honestly bothers me about this behavior is the converse: making the
receiver _smaller_ can cause the scanner to actually _miss_ matches,
like so:

  a, strs =3D '    abcdef', []
  a.scan(/[\w]/) { |s| a[0, 1] =3D ''; strs << s }
  strs #=3D> ['a', 'c', 'e']=20

Most people would expect ['a', 'b', 'c, 'e', 'f'] there.  This could be
"fixed" in a a couple of ways:

* Raise an exception if the receiver is modified during a scan (I don't
  really like this option).
* Attempt to hack in offset adjustment into string modification.  The
  functions in question are rb_str_splice() and rb_str_aref(), although
  I haven't investigated fully, so there may be other methods as well.
  This is really my least-favorite option, because it doesn't handle the
  case where someone modifies the receiver while keeping the length the
  same.
* Leave things as they are and add a big warning to the String#scan
  documentation.  Personally, I prefer this option.

Anyway, attached is a patch that adds a brief note to String#scan.  The
patch is against 1.8.4, but it applies clean to HEAD as well.

--=20
Paul Duncan <pabs / pablotron.org>        OpenPGP Key ID: 0x82C29562
http://www.pablotron.org/               http://www.paulduncan.org/

--LpQ9ahxlCli8rRTG
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="ruby-1.8.4-str_scan_warning.diff"
Content-Transfer-Encoding: quoted-printable

diff -ur ruby-1.8.4/string.c ruby-1.8.4-string_doc/string.c
--- ruby-1.8.4/string.c	2005-10-27 04:19:20.000000000 -0400
+++ ruby-1.8.4-string_doc/string.c	2006-01-26 11:52:03.000000000 -0500
@@ -4240,6 +4240,11 @@
  *    =20
  *     <<cruel>> <<world>>
  *     rceu lowlr
+ *    =20
+ *  <em>Note:</em> You probably don't want to modify the receiver string
+ *  inside the block.  Ruby will let you do it, but the result probably
+ *  won't be what you expect or what you want.
+ *    =20
  */
=20
 static VALUE

--LpQ9ahxlCli8rRTG--

--DIOMP1UsTsWJauNi
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFD2P+WzdlT34LClWIRAjIfAKCrEouFqMratTJ7GK8nJ+5hBQAqBgCglNxo
P1wVe0cg0TNrtbl2eriK0qI=
=x+k1
-----END PGP SIGNATURE-----

--DIOMP1UsTsWJauNi--

* noreply / rubyforge.org (noreply / rubyforge.org) wrote:
> Summary: String#scan loops forefever if scanned string is modified inside=
 block.

The subject doesn't really reflect what's actually happening.

> Initial Comment:
> ruby 1.8.4 (2005-12-24)
>=20
> Following code loops infinitely:
>=20
> a =3D " 12345678 "; a.scan(/\d/) {|s| a[3,2]=3D'test';  s}=20

I'm not convinced this is a bug per-se.  At least not any more than=20
"loop { }" is.  What's actually happening is easier to demonstrate than
explain, so here goes (I'm using the caret as the position indicator).

  " 12345678"
   ^ #=3D> no match
  " 12345678"
    ^ #=3D> match, a =3D " 01test45678 "
  " 12test45678 "
     ^ #=3D> match, a =3D " 12testst5678 "
  " 12testst5678 "
      ^ #=3D> no match
    ... (snipped several irrelevant steps)
  " 12testst5678 "
           ^ #=3D> no match
  " 12testst5678 "
            ^ #=3D> match, a =3D " 12teststst5678 "  <-- eek!
  " 12teststst5678 " =20
             ^ #=3D> no match
  " 12teststst5678 " =20
              ^ #=3D> match, a =3D " 12testststst5678 "
  " 12testststst5678 "
               ^ #=3D> no match
  " 12testststst5678 "
                ^ #=3D> match, a =3D " 12teststststst5678 "
  (and so on, ad infinitum)

What honestly bothers me about this behavior is the converse: making the
receiver _smaller_ can cause the scanner to actually _miss_ matches,
like so:

  a, strs =3D '    abcdef', []
  a.scan(/[\w]/) { |s| a[0, 1] =3D ''; strs << s }
  strs #=3D> ['a', 'c', 'e']=20

Most people would expect ['a', 'b', 'c, 'e', 'f'] there.  This could be
"fixed" in a a couple of ways:

* Raise an exception if the receiver is modified during a scan (I don't
  really like this option).
* Attempt to hack in offset adjustment into string modification.  The
  functions in question are rb_str_splice() and rb_str_aref(), although
  I haven't investigated fully, so there may be other methods as well.
  This is really my least-favorite option, because it doesn't handle the
  case where someone modifies the receiver while keeping the length the
  same.
* Leave things as they are and add a big warning to the String#scan
  documentation.  Personally, I prefer this option.

Anyway, attached is a patch that adds a brief note to String#scan.  The
patch is against 1.8.4, but it applies clean to HEAD as well.

--=20
Paul Duncan <pabs / pablotron.org>        OpenPGP Key ID: 0x82C29562
http://www.pablotron.org/               http://www.paulduncan.org/
diff -ur ruby-1.8.4/string.c ruby-1.8.4-string_doc/string.c
--- ruby-1.8.4/string.c	2005-10-27 04:19:20.000000000 -0400
+++ ruby-1.8.4-string_doc/string.c	2006-01-26 11:52:03.000000000 -0500
@@ -4240,6 +4240,11 @@
  *    =20
  *     <<cruel>> <<world>>
  *     rceu lowlr
+ *    =20
+ *  <em>Note:</em> You probably don't want to modify the receiver string
+ *  inside the block.  Ruby will let you do it, but the result probably
+ *  won't be what you expect or what you want.
+ *    =20
  */
=20
 static VALUE
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFD2P+WzdlT34LClWIRAjIfAKCrEouFqMratTJ7GK8nJ+5hBQAqBgCglNxo
P1wVe0cg0TNrtbl2eriK0qI=
=x+k1
-----END PGP SIGNATURE-----