[Radiance-general] "Broken pipe" message from rpiece on multi-core Linux system

Wed Apr 11 10:22:52 PDT 2012

Hey Andy,

This jogs my memory a bit. Perhaps a different topic at this point not 
sure as this is more about clusters, radiance and rpiece. Another 
problem that I encountered with using rpiece on my cluster was sometimes 
the time tiles/pieces would get written into the image in the wrong 
place when using stock rpiece. My solution if I remember correctly was 
to customize rpiece so that each running instance of rpiece would write 
out its pieces each to their image file. These would all then get 
assembled as a post process. I think my idea behind this was to take 
advantage of the functionality that rpiece does offer.

-Jack

-- 
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction

On 4/11/2012 1:04 PM, Andy McNeil wrote:
> Hi Randolph,
>
> For what it's worth I don't use rpiece when I render on the cluster. 
>  I have a script that takes divides takes a view file, tile number and 
> number of rows an columns and will render the assigned tile number 
> (run_render.csh).   In the job submit script I distribute these tile 
> rendering tasks to multiple cores on multiple nodes.  I can't use the 
> ambient cache with this method, but i typically use rtcontrib so I 
> would be able to use it regardless.  There is also the problem that 
> some processors sit idle after they've finished their tile while other 
> processes are running, but I don't worry about it because computing 
> time on lawrencium is cheap and available.
>
> Snippets from my scripts are below.
>
> Andy
>
>
>
> ### job_submitt.bsh #####
>
>  #!/bin/bash
>  #    specify the queue: lr_debug, lr_batch
>  #PBS -q lr_batch
>  #PBS -A ac_rad71t
>  #PBS -l nodes=16:ppn=8:lr1
>  #PBS -l walltime=24:00:00
>  #PBS -m be
>  #PBS -M amcneil at lbl.gov <mailto:amcneil at lbl.gov>
>  #PBS -e run_v4a.err
>  #PBS -o run_v4a.out
>
> #   change to working directory & run the program
> cd ~/models/wwr60
>
> for i in {0..127}; do
> pbsdsh -n $i $PBS_O_WORKDIR/run_render.csh views/v4a.vf $(printf 
> "%03d" $i) 8 16 &
> done
>
> wait
>
>
>
>
> ### run_render.csh ######
> #! /bin/csh
>
> cd $PBS_O_WORKDIR
> set path=($path ~/applications/Radiance/bin/ )
>
> set oxres = 512
> set oyres = 512
>
> set view = $argv[1]
> set thispiece = $argv[2]
> set numcols = $argv[3]
> set numrows = $argv[4]
> set numpieces = `ev "$numcols * $numrows"`
>
> set pxres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print 
> int($2/'$numcols'+.5)}'`
> set pyres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print 
> int($4/'$numrows'+.5)}'`
>
> set vtype = `awk '{for(i=1;i<NF;i++) if(match($i,"-vt")==1) 
> split($i,vt,"")} END { print vt[4] }' $view`
> set vshift = `ev "$thispiece - $numcols * floor( $thispiece / 
> $numcols) - $numcols / 2 + .5"`
> set vlift = `ev "floor( $thispiece / $numcols ) - $numrows / 2 + .5"`
>
> if ($vtype == "v") then
> set vhoriz = `awk 'BEGIN{PI=3.14159265} \
> {for(i=1;i<NF;i++) if($i=="-vh") vh=$(i+1)*PI/180 } \
> END{print atan2(sin(vh/2)/'$numcols',cos(vh/2))*180/PI*2}' $view`
> set vvert = `awk 'BEGIN{PI=3.14159265} \
> {for(i=1;i<NF;i++) if($i=="-vv") vv=$(i+1)*PI/180 } \
> END{print atan2(sin(vv/2)/'$numrows',cos(vv/2))*180/PI*2}' $view`
> endif
>
> vwrays -ff -vf $view -vv $vvert -vh $vhoriz -vs $vshift -vl $vlift -x 
> $pxres -y $pyres \
> | rtcontrib -n 1 `vwrays -vf $view -vv $vvert -vh $vhoriz -vs $vshift 
> -vl $vlift -x $pxres -y $pyres -d` \
> -ffc -fo \
> -o binpics/wwr60/${view:t:r}/${view:t:r}_wwr60_%s_%04d_${thispiece}.hdr \
> -f klems_horiz.cal -bn Nkbins \
> -b 'kbin(0,1,0,0,0,1)' -m GlDay -b 'kbin(0,1,0,0,0,1)' -m GlView \
> -w -ab 6 -ad 6000 -lw 1e-7 -ds .07 -dc 1 oct/vmx.oct
>
>
>
>
>
>
>
> On Apr 11, 2012, at 5:54 AM, Jack de Valpine wrote:
>
>> Hi Randolph,
>>
>> All I have is Linux. Not sure what kernels at this point. But I have 
>> noticed this over multiple kernels and distributions. Although I have 
>> not run anything on the most recent kernels.
>>
>> I know that one thing I did was to disable the fork and wait 
>> functionality in rpiece to wait for a job to finish. I do not recall 
>> though if this was related to this problem, nfs locking, or running 
>> on a cluster with job distribution queuing....? Sorry I do not 
>> remember more right now.
>>
>> Just thinking out loud here, but if you are running on a cluster then 
>> could network latency also be an issue?
>>
>> Here is my suspicion/theory, which I have not been able to test. I 
>> think that somehow there is a race condition in the way jobs get 
>> forked off and status of pieces gets recorded in the syncfile...
>>
>> For testing/debugging purposes, a few things to look at compare might be:
>>
>>   * big scene - slow load time
>>   * small scene - fast load time
>>   * "fast" parameters - small image size with lots of divisions
>>   * "slow" parameters - small image size with lots of divisions
>>
>> On my cluster, I ended up setting up things so that any initial small 
>> image run for building the ambient cache would actually just run as a 
>> single rpict process and then large images would get distributed 
>> across nodes/cores.
>>
>> As an aside, perhaps Rob G. has some thoughts on Radiance/Clusters as 
>> I think they have a large one also. What is the cluster set up at 
>> LBNL? I believe that at one point they were using a provisioning 
>> system called Warewulf which has now evolved to Perceus. I have the 
>> former setup and have not gotten around to the latter. LBNL may also 
>> be using a job queuing system called Slurm which they developed (or 
>> maybe that was at LLNL)?
>>
>> Hopefully this is not leading you off on the wrong track though. 
>> Probably would be useful to figure out if the problem is indeed 
>> rpiece related or something else entirely.
>>
>> -Jack
>> -- 
>> # Jack de Valpine
>> # president
>> #
>> # visarc incorporated
>> #http://www.visarc.com
>> #
>> # channeling technology for superior design and construction
>>
>> On 4/11/2012 1:27 AM, Randolph M. Fritz wrote:
>>>
>>> Thanks Jack, Greg.
>>>
>>>
>>> Jack, what kernel were you using?Was it also Linux?
>>>
>>>
>>> Greg, I was using rad, so those delays are already in there, alas.I 
>>> wonder if there is some subtle difference between the Mac OS Mach 
>>> kernel and the Linux kernel that's causing the problem, or if it 
>>> occurs on all platforms, just more frequently in the very fast 
>>> cluster nodes.
>>>
>>>
>>> Or, it could be an NFS locking problem, bah.
>>>
>>>
>>> If I find time, maybe I can dig into it some more.Right now, I may 
>>> just finesse it by running multiple *different* simulations on the 
>>> same cluster node.
>>>
>>>
>>> Randolph
>>>
>>>
>>> On 2012-04-09 21:52:47 +0000, Greg Ward said:
>>>
>>>
>>> If it is a startup issue as Jack suggests, you might try inserting a 
>>> few seconds of delay between the spawning of each new rpiece process 
>>> using "sleep 5" or similar.  This allows time for the sync file to 
>>> be updated without contention between processes.  This is what I do 
>>> in rad with the -N option.  I actually wait 10 seconds between each 
>>> new rpiece process.
>>>
>>>
>>> This isn't to say that I understand the source of your error, which 
>>> still puzzles me.
>>>
>>>
>>> -Greg
>>>
>>> From: Jack de Valpine <jedev at visarc.com <mailto:jedev at visarc.com>>
>>>
>>> Date: April 9, 2012 1:46:03 PM PDT
>>>
>>>
>>> Hey Randolph,
>>>
>>>
>>> I have run into this before. Unfortunately I have had limited 
>>> success in tracking down the issue and also have not really looked 
>>> at it for some time. If I recall correctly, a couple of things that 
>>> I have noticed:
>>>
>>>   * possible problem if a piece finishes before the first set of
>>>     pieces are parcelled out out by rpiece - so if it 8 pieces are
>>>     being distributed at startup and piece 2 (for example) finishes
>>>     before one of pieces 1, 3, 4, 5, 6, 7, 8 has even been processed
>>>     by rpiece or while rpiece is still forking off the initial jobs.
>>>
>>> Sorry I cannot offer more, I have spent some time in the code on 
>>> this one and it is not for the faint of heart to say the least.
>>>
>>> -Jack
>>>
>>> --
>>>
>>> # Jack de Valpine
>>>
>>> # president
>>>
>>> #
>>>
>>> # visarc incorporated
>>>
>>> # http://www.visarc.com <http://www.visarc.com/>
>>>
>>> #
>>>
>>> # channeling technology for superior design and construction
>>>
>>>
>>> On 4/9/2012 3:29 PM, Randolph M. Fritz wrote:
>>>
>>> This problem is back for a sequel, and it would really help my work 
>>> if I could get it going.
>>>
>>>
>>> It's been a few months since I last asked about this.  Has anyone 
>>> else experienced this in a Linux environment?  Anyone have any ideas 
>>> what to do about it or how to debug it?
>>>
>>>
>>> /proc/version reports:
>>>
>>>  Linux version 2.6.18-274.18.1.el5 
>>> (mockbuild-t2f/um9L7dhDWr0U+X5jBOG/Ez6ZCGd0 at public.gmane.orgorg 
>>> <mailto:mockbuild at builder10.centos.org>) (gcc version 4.1.2 20080704 
>>> (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9 12:45:44 EST 2012
>>>
>>>
>>> Randolph
>>>
>>>
>>> On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:
>>>
>>>
>>> On 2011-07-07 16:54:06 -0700, Greg Ward said:
>>>
>>>
>>> Hi Randolph,
>>>
>>>
>>> This shouldn't happen, unless one of the rpict processes died
>>>
>>> unexpectedly.  Even then, I would expect some other kind of error to be
>>>
>>> reported as well.
>>>
>>>
>>> -Greg
>>>
>>>
>>> Thanks, Greg.  I think that's what happenned; in fact seven of the
>>>
>>> eight died in two cases.  Wierdly, the third succeeded.  If I run it as
>>>
>>> a single-processor job, it works.  Here's a piece of the log:
>>>
>>>
>>> rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd
>>>
>>> 12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6
>>>
>>> -ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as
>>>
>>> 392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct
>>>
>>>
>>> rpict: warning - no output produced
>>>
>>>
>>> rpict: system - write error in io_process: Broken pipe
>>>
>>> rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1
>>>
>>> rad: error rendering view blinds
>>>
>>>
>>>
>>> _______________________________________________
>>>
>>> Radiance-general mailing list
>>>
>>> Radiance-general at radiance-online.org<mailto:Radiance-general at radiance-online.org>
>>>
>>> http://www.radiance-online.org/mailman/listinfo/radiance-general
>>>
>>> _______________________________________________
>>>
>>> Radiance-general mailing list
>>>
>>> Radiance-general at radiance-online.org
>>>
>>> http://www.radiance-online.org/mailman/listinfo/radiance-general
>>>
>>>
>>>
>>> --
>>>
>>> Randolph M. Fritz
>>>
>>>
>>>
>>> _______________________________________________
>>> Radiance-general mailing list
>>> Radiance-general at radiance-online.org
>>> http://www.radiance-online.org/mailman/listinfo/radiance-general
>> _______________________________________________
>> Radiance-general mailing list
>> Radiance-general at radiance-online.org 
>> <mailto:Radiance-general at radiance-online.org>
>> http://www.radiance-online.org/mailman/listinfo/radiance-general
>
>
>
> _______________________________________________
> Radiance-general mailing list
> Radiance-general at radiance-online.org
> http://www.radiance-online.org/mailman/listinfo/radiance-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.radiance-online.org/pipermail/radiance-general/attachments/20120411/51f92c91/attachment-0001.html>