Issue #13618 has been updated by ioquatix (Samuel Williams).


> Using a background thread is your mistake.

Don't assume I made this design. It was made by other people. I merely tested it because I was interested in the performance overhead. And yes, there is significant overhead. And let's be generous, people who invested their time and effort to make such a thing for Ruby deserve our appreciation. Knowing that the path they chose to explore was not good is equally important.

> Multiple foreground threads safely use epoll_wait or kevent on the SAME epoll or kqueue fd. It's perfectly safe to do that.

Sure, that's reasonable. If you want to share those data structures across threads, you can dispatch your work in different threads too. I liked what you did with https://yhbt.net/yahns/yahns.txt and it's an interesting design.

The biggest single benefit of this design is that blocking operations in an individual "task" or "worker" won't block any other "task" or "worker", up to the limit of the thread pool you allocate, at which point things WILL start causing blocking. So you can't avoid blocking even with this design.

The major downside of such a design is that workers have to assume they could be running on different threads, so shared data structure needs locking/will cause contention. In addition the current state of the Ruby GIL means that any such design will generally have poor performance.

Here is almost identical code path running, one with 8 forked processes, and one with 8 threads, running on Ruby 2.5:

```
> falcon serve --threaded
> wrk -t8 -c8 -d10 http://localhost:9292
Running 10s test @ http://localhost:9292
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    54.67ms   25.39ms 189.02ms   72.29%
    Req/Sec    18.50      7.18    40.00     53.38%
  1483 requests in 10.04s, 174.88MB read
Requests/sec:    147.74
Transfer/sec:     17.42MB

> falcon serve --forked
> wrk -t8 -c8 -d10 http://localhost:9292
Running 10s test @ http://localhost:9292
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    29.77ms   66.90ms 571.70ms   93.71%
    Req/Sec    71.50     19.54   128.00     83.42%
  5442 requests in 10.10s, 641.61MB read
Requests/sec:    538.90
Transfer/sec:     63.54MB
```

This test is actually on a fresh Rails website (Rails performance isn't great to begin with), on macOS which has pretty bad IO performance. Running the same thing on Linux gives:

```
% falcon serve --threaded
% wrk -t8 -c8 -d10 http://localhost:9292
Running 10s test @ http://localhost:9292
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    26.41ms   13.74ms 123.01ms   69.85%
    Req/Sec    38.53     11.26    80.00     63.38%
  3082 requests in 10.01s, 363.36MB read
Requests/sec:    307.99
Transfer/sec:     36.31MB

% falcon serve --forked
% wrk -t8 -c8 -d10 http://localhost:9292
Running 10s test @ http://localhost:9292
  8 threads and 8 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.78ms   24.91ms 309.70ms   97.59%
    Req/Sec   168.68     49.75   262.00     63.89%
  13203 requests in 10.02s, 1.52GB read
Requests/sec:   1318.05
Transfer/sec:    155.39MB
```

So, I think it's safe to say, that in an end to end test, the GIL is a MAJOR performance issue. Feel free to correct me if you think I'm wrong. I'm sure this story is more complicated than the above benchmarks, but I felt like it was a useful comparison.

Therefore, right now, for highly concurrent IO with Ruby, what you actually want is the following:

- One process per CPU core.
- One IO thread per process.
- Multiple fibers, one per worker.

Blocking operations that are causing performance issues should use a thread pool. For things like launching an external process or syscall, and waiting for it to finish, threads are ideal.

The major benefit of such a design is that individual fibers all run on the same thread. You ultimately have similar issues w.r.t. blocking as yahns. However, because all workers run concurrently on the same thread, you don't have any locking/concurrency/mutability issues. To me, this is a massive benefit as it makes writing code with this model super easy.

> Typical reactor is not designed to handle that :P

Yes, but it's by design, not by accident. If you need to scale up, just fork more reactors. On the linux desktop above, `async-http` can handle 100,000+ requests per second using 8 cores for trivial benchmarks. So, performance is something which can scale. The next question then, is design.

There is some elegance in the design you propose. Your proposal requires some kind of "Task" or "Worker" which is a fiber which will yield when IO would block, and resume when IO is ready. Based on what you've said, do you mind explaining whether the "Task" or "Worker" is resumed on the same thread or a different one? Do you maintain a thread pool?

If it's always resumed on the same thread, how do you manage that? e.g. perhaps you can show me how the following would work:

```
Thread.new do
	Worker.new do
		# .. blocking IO
	end
	
	Worker.new do
		# .. blocking IO
	end
	
	# implicitly waits for all workers to complete?
end
```

If you following this model, the thread must be calling into `epoll` or `kqueue` in order to resume work. But based on what you've said, if you have several of the above threads running, and the thread itself is invoking `epoll_wait`, then it receives events for a different thread, how does that work? Do you send the events to the different thread? If you do that, what is the overhead? If you don't do that, do you move workers between threads?

Then, why not consider the similar model to async which uses per-thread reactors. The workers do not move around threads, and the reactor does not need to send events to other threads.

Thanks for your continued time and patience discussing these interesting issues.

----------------------------------------
Feature #13618: [PATCH] auto fiber schedule for rb_wait_for_single_fd and rb_waitpid
https://bugs.ruby-lang.org/issues/13618#change-71723

* Author: normalperson (Eric Wong)
* Status: Assigned
* Priority: Normal
* Assignee: normalperson (Eric Wong)
* Target version: 
----------------------------------------
```
auto fiber schedule for rb_wait_for_single_fd and rb_waitpid

Implement automatic Fiber yield and resume when running
rb_wait_for_single_fd and rb_waitpid.

The Ruby API changes for Fiber are named after existing Thread
methods.

main Ruby API:

    Fiber#start -> enable auto-scheduling and run Fiber until it
		   automatically yields (due to EAGAIN/EWOULDBLOCK)

The following behave like their Thread counterparts:

    Fiber.start - Fiber.new + Fiber#start (prelude.rb)
    Fiber#join - run internal scheduler until Fiber is terminated
    Fiber#value - ditto
    Fiber#run - like Fiber#start (prelude.rb)

Right now, it takes over rb_wait_for_single_fd() and
rb_waitpid() function if the running Fiber is auto-enabled
(cont.c::rb_fiber_auto_sched_p)

Changes to existing functions are minimal.

New files (all new structs and relations should be documented):

    iom.h - internal API for the rest of RubyVM (incomplete?)
    iom_internal.h - internal header for iom_(select|epoll|kqueue).h
    iom_epoll.h - epoll-specific pieces
    iom_kqueue.h - kqueue-specific pieces
    iom_select.h - select-specific pieces
    iom_pingable_common.h - common code for iom_(epoll|kqueue).h
    iom_common.h - common footer for iom_(select|epoll|kqueue).h

Changes to existing data structures:

    rb_thread_t.afrunq   - list of fibers to auto-resume
    rb_vm_t.iom          - Ruby I/O Manager (rb_iom_t) :)

Besides rb_iom_t, all the new structs are stack-only and relies
extensively on ccan/list for branch-less, O(1) insert/delete.

As usual, understanding the data structures first should help
you understand the code.

Right now, I reuse some static functions in thread.c,
so thread.c includes iom_(select|epoll|kqueue).h

TODO:

    Hijack other blocking functions (IO.select, ...)

I am using "double" for timeout since it is more convenient for
arithmetic like parts of thread.c.   Most platforms have good FP,
I think.  Also, all "blocking" functions (rb_iom_wait*) will
have timeout support.

./configure gains a new --with-iom=(select|epoll|kqueue) switch

libkqueue:

  libkqueue support is incomplete; corner cases are not handled well:

    1) multiple fibers waiting on the same FD
    2) waiting for both read and write events on the same FD

  Bugfixes to libkqueue may be necessary to support all corner cases.
  Supporting these corner cases for native kqueue was challenging,
  even.  See comments on iom_kqueue.h and iom_epoll.h for
  nuances.

Limitations

Test script I used to download a file from my server:
----8<---
require 'net/http'
require 'uri'
require 'digest/sha1'
require 'fiber'

url = 'http://80x24.org/git-i-forgot-to-pack/objects/pack/pack-97b25a76c03b489d4cbbd85b12d0e1ad28717e55.idx'

uri = URI(url)
use_ssl = "https" == uri.scheme
fibs = 10.times.map do
  Fiber.start do
    cur = Fiber.current.object_id
    # XXX getaddrinfo() and connect() are blocking
    # XXX resolv/replace + connect_nonblock
    Net::HTTP.start(uri.host, uri.port, use_ssl: use_ssl) do |http|
      req = Net::HTTP::Get.new(uri)
      http.request(req) do |res|
    dig = Digest::SHA1.new
    res.read_body do |buf|
      dig.update(buf)
      #warn "#{cur} #{buf.bytesize}\n"
    end
    warn "#{cur} #{dig.hexdigest}\n"
      end
    end
    warn "done\n"
    :done
  end
end

warn "joining #{Time.now}\n"
fibs[-1].join(4)
warn "joined #{Time.now}\n"
all = fibs.dup

warn "1 joined, wait for the rest\n"
until fibs.empty?
  fibs.each(&:join)
  fibs.keep_if(&:alive?)
  warn fibs.inspect
end

p all.map(&:value)

Fiber.new do
  puts 'HI'
end.run.join
```


---Files--------------------------------
0001-auto-fiber-schedule-for-rb_wait_for_single_fd-and-rb.patch (82.8 KB)


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>