Issue #15166 has been updated by jzakiya (Jabari Zakiya).


Hi 

I just submitted this issue feature request: 

https://bugs.ruby-lang.org/issues/15172

to deal with the issue of using (or not) the ``__builtin_ctz`` compiler directive.

I implemented code that mimicked it that also greatly increases the ruby ``gcd`` performance.
I included the new code and benchmarks to the gist I previously linked.

https://gist.github.com/jzakiya/44eae4feeda8f6b048e19ff41a0c6566

```
[jzakiya@jabari-pc ~]$ ./gcd2
gcd between numbers in [1 and 2000]

gcdwikipedia7fast32   :  time = 73
gcdwikipedia4fast     :  time = 113
gcdFranke             :  time = 133
gcdwikipedia3fast     :  time = 139
gcdwikipedia2fastswap :  time = 162
gcdwikipedia5fast     :  time = 140
gcdwikipedia7fast     :  time = 129
gcdwikipedia2fast     :  time = 161
gcdwikipedia6fastxchg :  time = 145
gcdwikipedia2fastxchg :  time = 168
gcd_iterative_mod     :  time = 230
gcd_recursive         :  time = 232
basicgcd              :  time = 234
rubygcd               :  time = 305
gcdwikipedia2         :  time = 312
gcdwikipedia7fast32_a :  time = 129
gcdwikipedia4fast_a   :  time = 149
rubygcd_a             :  time = 193
rubygcd_b             :  time = 169
gcd between numbers in [1000000001 and 1000002000]

gcdwikipedia7fast32   :  time = 76
gcdwikipedia4fast     :  time = 106
gcdFranke             :  time = 121
gcdwikipedia3fast     :  time = 127
gcdwikipedia2fastswap :  time = 153
gcdwikipedia5fast     :  time = 126
gcdwikipedia7fast     :  time = 118
gcdwikipedia2fast     :  time = 148
gcdwikipedia6fastxchg :  time = 134
gcdwikipedia2fastxchg :  time = 154
gcd_iterative_mod     :  time = 215
gcd_recursive         :  time = 214
basicgcd              :  time = 220
rubygcd               :  time = 287
gcdwikipedia2         :  time = 289
gcdwikipedia7fast32_a :  time = 116
gcdwikipedia4fast_a   :  time = 142
rubygcd_a             :  time = 180
rubygcd_b             :  time = 155
```

Note using the ``__builtin_ctz`` mimicking code, instead of the directive itself,
still makes the ``gcdwikipedia7fast32_a`` the third fastest version, and obviously
the preferred implementation if not using ``__builtin_ctz``.

I present this in asking you how you want me to proceed, because I don't really know
C code and how to do PRs to Ruby. If you can lay out a detailed process for me to do
that maybe I can assess what is in my capacity to do.

At minimum, the code for ``rubygcd_a`` could|can be incorporated into the codebase 
without dealing right now with the ``__builtin_ctz`` directive issue.


----------------------------------------
Feature #15166: 2.5 times faster implementation than current gcd implmentation
https://bugs.ruby-lang.org/issues/15166#change-74224

* Author: jzakiya (Jabari Zakiya)
* Status: Assigned
* Priority: Normal
* Assignee: watson1978 (Shizuo Fujita)
* Target version: 
----------------------------------------
This is to be more explicit (and accurate) than https://bugs.ruby-lang.org/issues/15161

This is my modified gcd benchmarks code, originally presented by Daniel Lemire (see 15161).

https://gist.github.com/jzakiya/44eae4feeda8f6b048e19ff41a0c6566

Ruby's current implementation of Stein's gcd algorithm is only slightly faster than the
code posted on the wikepedia page, and over 2.5 times slower than the fastest implementation
in the benchmarks.

```
[jzakiya@localhost ~]$ ./gcdbenchmarks
gcd between numbers in [1 and 2000]

gcdwikipedia7fast32   :  time = 99
gcdwikipedia4fast     :  time = 121
gcdFranke             :  time = 126
gcdwikipedia3fast     :  time = 134
gcdwikipedia2fastswap :  time = 136
gcdwikipedia5fast     :  time = 139
gcdwikipedia7fast     :  time = 138
gcdwikipedia2fast     :  time = 136
gcdwikipedia6fastxchg :  time = 144
gcdwikipedia2fastxchg :  time = 156
gcd_iterative_mod     :  time = 210
gcd_recursive         :  time = 215
basicgcd              :  time = 211
rubygcd               :  time = 267
gcdwikipedia2         :  time = 321
gcd between numbers in [1000000001 and 1000002000]

gcdwikipedia7fast32   :  time = 100
gcdwikipedia4fast     :  time = 121
gcdFranke             :  time = 126
gcdwikipedia3fast     :  time = 134
gcdwikipedia2fastswap :  time = 136
gcdwikipedia5fast     :  time = 138
gcdwikipedia7fast     :  time = 138
gcdwikipedia2fast     :  time = 136
gcdwikipedia6fastxchg :  time = 144
gcdwikipedia2fastxchg :  time = 156
gcd_iterative_mod     :  time = 210
gcd_recursive         :  time = 215
basicgcd              :  time = 211
rubygcd               :  time = 269
gcdwikipedia2         :  time = 323
```

This is Ruby's code per: https://github.com/ruby/ruby/blob/3abbaab1a7a97d18f481164c7dc48749b86d7f39/rational.c#L285-L307
which is basically the wikepedia implementation.

```
inline static long
i_gcd(long x, long y)
{
    unsigned long u, v, t;
    int shift;

    if (x < 0)
	x = -x;
    if (y < 0)
	y = -y;

    if (x == 0)
	return y;
    if (y == 0)
	return x;

    u = (unsigned long)x;
    v = (unsigned long)y;
    for (shift = 0; ((u | v) & 1) == 0; ++shift) {
	u >>= 1;
	v >>= 1;
    }

    while ((u & 1) == 0)
	u >>= 1;

    do {
	while ((v & 1) == 0)
	    v >>= 1;

	if (u > v) {
	    t = v;
	    v = u;
	    u = t;
	}
	v = v - u;
    } while (v != 0);

    return (long)(u << shift);
}
```

This is the fastest implementation from the benchmarks. (I originally, wrongly, cited
the implementation in the article, which is 4|5th fastest in benchmarks, but
still almost 2x faster than the Ruby implementation.)

```
// based on wikipedia's article, 
// fixed by D. Lemire,  K. Willets
unsigned int gcdwikipedia7fast32(unsigned int u, unsigned int v)
{
     int shift, uz, vz;
     if ( u == 0) return v;
     if ( v == 0) return u;
     uz = __builtin_ctz(u);
     vz = __builtin_ctz(v);
     shift = uz > vz ? vz : uz;
     u >>= uz;
     do {
       v >>= vz;
       int diff = v;
       diff -= u;
       if ( diff == 0 ) break;
       vz = __builtin_ctz(diff);
       if ( v <  u ) u = v;
       v = abs(diff);
    
     } while( 1 );
     return u << shift;
}
```

The key to speeding up all the algorithms is using the ``__builtin_ctz(x)`` directive
to determine the number of trailing binary '0's.



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>