On Sun, 26 Jan 2003, Joel VanderWerf wrote:

> Any solaris gurus out there?

any chance your home directory, or where ever you are running, is nfs mounted?

-a

>
> I'm having trouble porting some multi-thread, multi-process code from
> linux to solaris. I've already dealt with (or tried to deal with) some
> differences in flock (solaris flock is based on fcntl locks), like the
> fact that closing a file releases locks on the file held by other threads.
>
> I've managed to isolate the problem in a fairly simple test program. It's at
>
>    http://path.berkeley.edu/~vjoel/ruby/solaris-bug.rb
>
> The program creates /tmp/test-file-lock.dat, which holds a marshalled
> fixnum starting at 0. Then it creates Np processes each with Nt threads
> which do a random sequence of reads and writes using some locking
> methods. The writes just increment the counter.
>
> When a process is done, it writes the number of times it incremented the
> counter to the file /tmp/test-file-lock.dat#{pid}. Then the main process
> adds these up and compares with the contents of the counter file. The
> point of this is to test for colliding writers.
>
> But the program fails before that final test--it seems to be having a
> collision between a reader and a writer that causes the reader to see a
> corrupt file.
>
> A typical run fails like this. The counter 0..3 is a seconds clock:
>
>    $ ruby solaris-bug.rb
>    0
>    1
>    2
>    3
>    solaris-bug.rb:128:in `load': marshal data too short (ArgumentError)
>
> It looks like there are a reader and a writer accessing the file at the
> same time, and the writer has just truncated the file (line 137) when
> the reader tries to read it.
>
> This happens:
>
>    - on solaris, quad cpu
>      - ruby 1.7.3 (2002-10-30) [sparc-solaris2.7]
>
>    - *not* on single processor linux
>      - ruby 1.7.3 (2002-12-12) [i686-linux]
>
>    - *not* on dual SMP linux
>      - ruby 1.6.7 (2002-03-01) [i686-linux]
>
> Also, the bug requires *both* of:
>
>    - thread_count >= 2
>
>    - process_count >= 2
>
> Also, the bug requires that there be both reader and writer operations
> (i.e., that the random number lead to each branch often enough, say 50/50).
>
>
>

-- 

 ====================================
 | Ara Howard
 | NOAA Forecast Systems Laboratory
 | Information and Technology Services
 | Data Systems Group
 | R/FST 325 Broadway
 | Boulder, CO 80305-3328
 | Email: ahoward / fsl.noaa.gov
 | Phone:  303-497-7238
 | Fax:    303-497-7259
 ====================================