[Radiance-general] "Broken pipe" message from rpiece on multi-core Linux system

Jack de Valpine jedev at visarc.com
Wed Apr 11 05:54:58 PDT 2012


Hi Randolph,

All I have is Linux. Not sure what kernels at this point. But I have 
noticed this over multiple kernels and distributions. Although I have 
not run anything on the most recent kernels.

I know that one thing I did was to disable the fork and wait 
functionality in rpiece to wait for a job to finish. I do not recall 
though if this was related to this problem, nfs locking, or running on a 
cluster with job distribution queuing....? Sorry I do not remember more 
right now.

Just thinking out loud here, but if you are running on a cluster then 
could network latency also be an issue?

Here is my suspicion/theory, which I have not been able to test. I think 
that somehow there is a race condition in the way jobs get forked off 
and status of pieces gets recorded in the syncfile...

For testing/debugging purposes, a few things to look at compare might be:

  * big scene - slow load time
  * small scene - fast load time
  * "fast" parameters - small image size with lots of divisions
  * "slow" parameters - small image size with lots of divisions

On my cluster, I ended up setting up things so that any initial small 
image run for building the ambient cache would actually just run as a 
single rpict process and then large images would get distributed across 
nodes/cores.

As an aside, perhaps Rob G. has some thoughts on Radiance/Clusters as I 
think they have a large one also. What is the cluster set up at LBNL? I 
believe that at one point they were using a provisioning system called 
Warewulf which has now evolved to Perceus. I have the former setup and 
have not gotten around to the latter. LBNL may also be using a job 
queuing system called Slurm which they developed (or maybe that was at 
LLNL)?

Hopefully this is not leading you off on the wrong track though. 
Probably would be useful to figure out if the problem is indeed rpiece 
related or something else entirely.

-Jack

-- 
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction


On 4/11/2012 1:27 AM, Randolph M. Fritz wrote:
>
> Thanks Jack, Greg.
>
>
> Jack, what kernel were you using?Was it also Linux?
>
>
> Greg, I was using rad, so those delays are already in there, alas.I 
> wonder if there is some subtle difference between the Mac OS Mach 
> kernel and the Linux kernel that's causing the problem, or if it 
> occurs on all platforms, just more frequently in the very fast cluster 
> nodes.
>
>
> Or, it could be an NFS locking problem, bah.
>
>
> If I find time, maybe I can dig into it some more.Right now, I may 
> just finesse it by running multiple *different* simulations on the 
> same cluster node.
>
>
> Randolph
>
>
> On 2012-04-09 21:52:47 +0000, Greg Ward said:
>
>
> If it is a startup issue as Jack suggests, you might try inserting a 
> few seconds of delay between the spawning of each new rpiece process 
> using "sleep 5" or similar.  This allows time for the sync file to be 
> updated without contention between processes.  This is what I do in 
> rad with the -N option.  I actually wait 10 seconds between each new 
> rpiece process.
>
>
> This isn't to say that I understand the source of your error, which 
> still puzzles me.
>
>
> -Greg
>
> From: Jack de Valpine <jedev at visarc.com <mailto:jedev at visarc.com>>
>
> Date: April 9, 2012 1:46:03 PM PDT
>
>
> Hey Randolph,
>
>
> I have run into this before. Unfortunately I have had limited success 
> in tracking down the issue and also have not really looked at it for 
> some time. If I recall correctly, a couple of things that I have noticed:
>
>   * possible problem if a piece finishes before the first set of
>     pieces are parcelled out out by rpiece - so if it 8 pieces are
>     being distributed at startup and piece 2 (for example) finishes
>     before one of pieces 1, 3, 4, 5, 6, 7, 8 has even been processed
>     by rpiece or while rpiece is still forking off the initial jobs.
>
> Sorry I cannot offer more, I have spent some time in the code on this 
> one and it is not for the faint of heart to say the least.
>
> -Jack
>
> --
>
> # Jack de Valpine
>
> # president
>
> #
>
> # visarc incorporated
>
> # http://www.visarc.com <http://www.visarc.com/>
>
> #
>
> # channeling technology for superior design and construction
>
>
> On 4/9/2012 3:29 PM, Randolph M. Fritz wrote:
>
> This problem is back for a sequel, and it would really help my work if 
> I could get it going.
>
>
> It's been a few months since I last asked about this.  Has anyone else 
> experienced this in a Linux environment?  Anyone have any ideas what 
> to do about it or how to debug it?
>
>
> /proc/version reports:
>
>  Linux version 2.6.18-274.18.1.el5 
> (mockbuild-t2f/um9L7dhDWr0U+X5jBOG/Ez6ZCGd0 at public.gmane.orgorg 
> <mailto:mockbuild at builder10.centos.org>) (gcc version 4.1.2 20080704 
> (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9 12:45:44 EST 2012
>
>
> Randolph
>
>
> On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:
>
>
> On 2011-07-07 16:54:06 -0700, Greg Ward said:
>
>
> Hi Randolph,
>
>
> This shouldn't happen, unless one of the rpict processes died
>
> unexpectedly.  Even then, I would expect some other kind of error to be
>
> reported as well.
>
>
> -Greg
>
>
> Thanks, Greg.  I think that's what happenned; in fact seven of the
>
> eight died in two cases.  Wierdly, the third succeeded.  If I run it as
>
> a single-processor job, it works.  Here's a piece of the log:
>
>
> rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd
>
> 12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6
>
> -ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as
>
> 392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct
>
>
> rpict: warning - no output produced
>
>
> rpict: system - write error in io_process: Broken pipe
>
> rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1
>
> rad: error rendering view blinds
>
>
>
> _______________________________________________
>
> Radiance-general mailing list
>
> Radiance-general at radiance-online.org<mailto:Radiance-general at radiance-online.org>
>
> http://www.radiance-online.org/mailman/listinfo/radiance-general
>
> _______________________________________________
>
> Radiance-general mailing list
>
> Radiance-general at radiance-online.org
>
> http://www.radiance-online.org/mailman/listinfo/radiance-general
>
>
>
> --
>
> Randolph M. Fritz
>
>
>
> _______________________________________________
> Radiance-general mailing list
> Radiance-general at radiance-online.org
> http://www.radiance-online.org/mailman/listinfo/radiance-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.radiance-online.org/pipermail/radiance-general/attachments/20120411/ea195513/attachment-0001.html>


More information about the Radiance-general mailing list