[Radiance-general] "Broken pipe" message from rpiece on multi-core Linux system
Jack de Valpine
jedev at visarc.com
Wed Apr 11 10:22:52 PDT 2012
Hey Andy,
This jogs my memory a bit. Perhaps a different topic at this point not
sure as this is more about clusters, radiance and rpiece. Another
problem that I encountered with using rpiece on my cluster was sometimes
the time tiles/pieces would get written into the image in the wrong
place when using stock rpiece. My solution if I remember correctly was
to customize rpiece so that each running instance of rpiece would write
out its pieces each to their image file. These would all then get
assembled as a post process. I think my idea behind this was to take
advantage of the functionality that rpiece does offer.
-Jack
--
# Jack de Valpine
# president
#
# visarc incorporated
# http://www.visarc.com
#
# channeling technology for superior design and construction
On 4/11/2012 1:04 PM, Andy McNeil wrote:
> Hi Randolph,
>
> For what it's worth I don't use rpiece when I render on the cluster.
> I have a script that takes divides takes a view file, tile number and
> number of rows an columns and will render the assigned tile number
> (run_render.csh). In the job submit script I distribute these tile
> rendering tasks to multiple cores on multiple nodes. I can't use the
> ambient cache with this method, but i typically use rtcontrib so I
> would be able to use it regardless. There is also the problem that
> some processors sit idle after they've finished their tile while other
> processes are running, but I don't worry about it because computing
> time on lawrencium is cheap and available.
>
> Snippets from my scripts are below.
>
> Andy
>
>
>
> ### job_submitt.bsh #####
>
> #!/bin/bash
> # specify the queue: lr_debug, lr_batch
> #PBS -q lr_batch
> #PBS -A ac_rad71t
> #PBS -l nodes=16:ppn=8:lr1
> #PBS -l walltime=24:00:00
> #PBS -m be
> #PBS -M amcneil at lbl.gov <mailto:amcneil at lbl.gov>
> #PBS -e run_v4a.err
> #PBS -o run_v4a.out
>
> # change to working directory & run the program
> cd ~/models/wwr60
>
> for i in {0..127}; do
> pbsdsh -n $i $PBS_O_WORKDIR/run_render.csh views/v4a.vf $(printf
> "%03d" $i) 8 16 &
> done
>
> wait
>
>
>
>
> ### run_render.csh ######
> #! /bin/csh
>
> cd $PBS_O_WORKDIR
> set path=($path ~/applications/Radiance/bin/ )
>
> set oxres = 512
> set oyres = 512
>
> set view = $argv[1]
> set thispiece = $argv[2]
> set numcols = $argv[3]
> set numrows = $argv[4]
> set numpieces = `ev "$numcols * $numrows"`
>
> set pxres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print
> int($2/'$numcols'+.5)}'`
> set pyres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print
> int($4/'$numrows'+.5)}'`
>
> set vtype = `awk '{for(i=1;i<NF;i++) if(match($i,"-vt")==1)
> split($i,vt,"")} END { print vt[4] }' $view`
> set vshift = `ev "$thispiece - $numcols * floor( $thispiece /
> $numcols) - $numcols / 2 + .5"`
> set vlift = `ev "floor( $thispiece / $numcols ) - $numrows / 2 + .5"`
>
> if ($vtype == "v") then
> set vhoriz = `awk 'BEGIN{PI=3.14159265} \
> {for(i=1;i<NF;i++) if($i=="-vh") vh=$(i+1)*PI/180 } \
> END{print atan2(sin(vh/2)/'$numcols',cos(vh/2))*180/PI*2}' $view`
> set vvert = `awk 'BEGIN{PI=3.14159265} \
> {for(i=1;i<NF;i++) if($i=="-vv") vv=$(i+1)*PI/180 } \
> END{print atan2(sin(vv/2)/'$numrows',cos(vv/2))*180/PI*2}' $view`
> endif
>
> vwrays -ff -vf $view -vv $vvert -vh $vhoriz -vs $vshift -vl $vlift -x
> $pxres -y $pyres \
> | rtcontrib -n 1 `vwrays -vf $view -vv $vvert -vh $vhoriz -vs $vshift
> -vl $vlift -x $pxres -y $pyres -d` \
> -ffc -fo \
> -o binpics/wwr60/${view:t:r}/${view:t:r}_wwr60_%s_%04d_${thispiece}.hdr \
> -f klems_horiz.cal -bn Nkbins \
> -b 'kbin(0,1,0,0,0,1)' -m GlDay -b 'kbin(0,1,0,0,0,1)' -m GlView \
> -w -ab 6 -ad 6000 -lw 1e-7 -ds .07 -dc 1 oct/vmx.oct
>
>
>
>
>
>
>
> On Apr 11, 2012, at 5:54 AM, Jack de Valpine wrote:
>
>> Hi Randolph,
>>
>> All I have is Linux. Not sure what kernels at this point. But I have
>> noticed this over multiple kernels and distributions. Although I have
>> not run anything on the most recent kernels.
>>
>> I know that one thing I did was to disable the fork and wait
>> functionality in rpiece to wait for a job to finish. I do not recall
>> though if this was related to this problem, nfs locking, or running
>> on a cluster with job distribution queuing....? Sorry I do not
>> remember more right now.
>>
>> Just thinking out loud here, but if you are running on a cluster then
>> could network latency also be an issue?
>>
>> Here is my suspicion/theory, which I have not been able to test. I
>> think that somehow there is a race condition in the way jobs get
>> forked off and status of pieces gets recorded in the syncfile...
>>
>> For testing/debugging purposes, a few things to look at compare might be:
>>
>> * big scene - slow load time
>> * small scene - fast load time
>> * "fast" parameters - small image size with lots of divisions
>> * "slow" parameters - small image size with lots of divisions
>>
>> On my cluster, I ended up setting up things so that any initial small
>> image run for building the ambient cache would actually just run as a
>> single rpict process and then large images would get distributed
>> across nodes/cores.
>>
>> As an aside, perhaps Rob G. has some thoughts on Radiance/Clusters as
>> I think they have a large one also. What is the cluster set up at
>> LBNL? I believe that at one point they were using a provisioning
>> system called Warewulf which has now evolved to Perceus. I have the
>> former setup and have not gotten around to the latter. LBNL may also
>> be using a job queuing system called Slurm which they developed (or
>> maybe that was at LLNL)?
>>
>> Hopefully this is not leading you off on the wrong track though.
>> Probably would be useful to figure out if the problem is indeed
>> rpiece related or something else entirely.
>>
>> -Jack
>> --
>> # Jack de Valpine
>> # president
>> #
>> # visarc incorporated
>> #http://www.visarc.com
>> #
>> # channeling technology for superior design and construction
>>
>> On 4/11/2012 1:27 AM, Randolph M. Fritz wrote:
>>>
>>> Thanks Jack, Greg.
>>>
>>>
>>> Jack, what kernel were you using?Was it also Linux?
>>>
>>>
>>> Greg, I was using rad, so those delays are already in there, alas.I
>>> wonder if there is some subtle difference between the Mac OS Mach
>>> kernel and the Linux kernel that's causing the problem, or if it
>>> occurs on all platforms, just more frequently in the very fast
>>> cluster nodes.
>>>
>>>
>>> Or, it could be an NFS locking problem, bah.
>>>
>>>
>>> If I find time, maybe I can dig into it some more.Right now, I may
>>> just finesse it by running multiple *different* simulations on the
>>> same cluster node.
>>>
>>>
>>> Randolph
>>>
>>>
>>> On 2012-04-09 21:52:47 +0000, Greg Ward said:
>>>
>>>
>>> If it is a startup issue as Jack suggests, you might try inserting a
>>> few seconds of delay between the spawning of each new rpiece process
>>> using "sleep 5" or similar. This allows time for the sync file to
>>> be updated without contention between processes. This is what I do
>>> in rad with the -N option. I actually wait 10 seconds between each
>>> new rpiece process.
>>>
>>>
>>> This isn't to say that I understand the source of your error, which
>>> still puzzles me.
>>>
>>>
>>> -Greg
>>>
>>> From: Jack de Valpine <jedev at visarc.com <mailto:jedev at visarc.com>>
>>>
>>> Date: April 9, 2012 1:46:03 PM PDT
>>>
>>>
>>> Hey Randolph,
>>>
>>>
>>> I have run into this before. Unfortunately I have had limited
>>> success in tracking down the issue and also have not really looked
>>> at it for some time. If I recall correctly, a couple of things that
>>> I have noticed:
>>>
>>> * possible problem if a piece finishes before the first set of
>>> pieces are parcelled out out by rpiece - so if it 8 pieces are
>>> being distributed at startup and piece 2 (for example) finishes
>>> before one of pieces 1, 3, 4, 5, 6, 7, 8 has even been processed
>>> by rpiece or while rpiece is still forking off the initial jobs.
>>>
>>> Sorry I cannot offer more, I have spent some time in the code on
>>> this one and it is not for the faint of heart to say the least.
>>>
>>> -Jack
>>>
>>> --
>>>
>>> # Jack de Valpine
>>>
>>> # president
>>>
>>> #
>>>
>>> # visarc incorporated
>>>
>>> # http://www.visarc.com <http://www.visarc.com/>
>>>
>>> #
>>>
>>> # channeling technology for superior design and construction
>>>
>>>
>>> On 4/9/2012 3:29 PM, Randolph M. Fritz wrote:
>>>
>>> This problem is back for a sequel, and it would really help my work
>>> if I could get it going.
>>>
>>>
>>> It's been a few months since I last asked about this. Has anyone
>>> else experienced this in a Linux environment? Anyone have any ideas
>>> what to do about it or how to debug it?
>>>
>>>
>>> /proc/version reports:
>>>
>>> Linux version 2.6.18-274.18.1.el5
>>> (mockbuild-t2f/um9L7dhDWr0U+X5jBOG/Ez6ZCGd0 at public.gmane.orgorg
>>> <mailto:mockbuild at builder10.centos.org>) (gcc version 4.1.2 20080704
>>> (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9 12:45:44 EST 2012
>>>
>>>
>>> Randolph
>>>
>>>
>>> On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:
>>>
>>>
>>> On 2011-07-07 16:54:06 -0700, Greg Ward said:
>>>
>>>
>>> Hi Randolph,
>>>
>>>
>>> This shouldn't happen, unless one of the rpict processes died
>>>
>>> unexpectedly. Even then, I would expect some other kind of error to be
>>>
>>> reported as well.
>>>
>>>
>>> -Greg
>>>
>>>
>>> Thanks, Greg. I think that's what happenned; in fact seven of the
>>>
>>> eight died in two cases. Wierdly, the third succeeded. If I run it as
>>>
>>> a single-processor job, it works. Here's a piece of the log:
>>>
>>>
>>> rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd
>>>
>>> 12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6
>>>
>>> -ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as
>>>
>>> 392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct
>>>
>>>
>>> rpict: warning - no output produced
>>>
>>>
>>> rpict: system - write error in io_process: Broken pipe
>>>
>>> rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1
>>>
>>> rad: error rendering view blinds
>>>
>>>
>>>
>>> _______________________________________________
>>>
>>> Radiance-general mailing list
>>>
>>> Radiance-general at radiance-online.org<mailto:Radiance-general at radiance-online.org>
>>>
>>> http://www.radiance-online.org/mailman/listinfo/radiance-general
>>>
>>> _______________________________________________
>>>
>>> Radiance-general mailing list
>>>
>>> Radiance-general at radiance-online.org
>>>
>>> http://www.radiance-online.org/mailman/listinfo/radiance-general
>>>
>>>
>>>
>>> --
>>>
>>> Randolph M. Fritz
>>>
>>>
>>>
>>> _______________________________________________
>>> Radiance-general mailing list
>>> Radiance-general at radiance-online.org
>>> http://www.radiance-online.org/mailman/listinfo/radiance-general
>> _______________________________________________
>> Radiance-general mailing list
>> Radiance-general at radiance-online.org
>> <mailto:Radiance-general at radiance-online.org>
>> http://www.radiance-online.org/mailman/listinfo/radiance-general
>
>
>
> _______________________________________________
> Radiance-general mailing list
> Radiance-general at radiance-online.org
> http://www.radiance-online.org/mailman/listinfo/radiance-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.radiance-online.org/pipermail/radiance-general/attachments/20120411/51f92c91/attachment-0001.html>
More information about the Radiance-general
mailing list