[Radiance-general] "Broken pipe" message from rpiece on multi-core Linux system
Randolph M. Fritz
RFritz at lbl.gov
Mon Apr 16 08:32:43 PDT 2012
I've gathered little more information Thursday and Friday.
First off, I am so far only using single nodes on our cluster for
renderings and instead taking advantage of multiple nodes to run more
renderings. Each node in the Lawrencium cluster has two six-core
Xeons, and they are very capable.
I ran a rendering on a different model in the same cluster environment,
which worked perfectly. My experience so far leads me to the following
conclusion: whatever the problem is, it is dependent on the model and
the simulation parameters. The model where it worked used mkillum; the
two where it doesn't, don't. This may--or may not--be significant.
I think adding mkillum surfaces to the models that failed would be an
interesting experiment. Maybe I can find some time to do it next week.
Randolph
On 2012-04-11 17:22:52 +0000, Jack de Valpine said:
> Hey Andy,
>
> This jogs my memory a bit. Perhaps a different topic at this point not
> sure as this is more about clusters, radiance and rpiece. Another
> problem that I encountered with using rpiece on my cluster was
> sometimes the time tiles/pieces would get written into the image in the
> wrong place when using stock rpiece. My solution if I remember
> correctly was to customize rpiece so that each running instance of
> rpiece would write out its pieces each to their image file. These would
> all then get assembled as a post process. I think my idea behind this
> was to take advantage of the functionality that rpiece does offer.
>
> -Jack
> --
> # Jack de Valpine
> # president
> #
> # visarc incorporated
> # http://www.visarc.com
> #
> # channeling technology for superior design and construction
>
> On 4/11/2012 1:04 PM, Andy McNeil wrote:
> Hi Randolph,
>
> For what it's worth I don't use rpiece when I render on the cluster. I
> have a script that takes divides takes a view file, tile number and
> number of rows an columns and will render the assigned tile number
> (run_render.csh). In the job submit script I distribute these tile
> rendering tasks to multiple cores on multiple nodes. I can't use the
> ambient cache with this method, but i typically use rtcontrib so I
> would be able to use it regardless. There is also the problem that
> some processors sit idle after they've finished their tile while other
> processes are running, but I don't worry about it because computing
> time on lawrencium is cheap and available.
>
> Snippets from my scripts are below.
>
> Andy
>
>
>
> ### job_submitt.bsh #####
>
> #!/bin/bash
> # specify the queue: lr_debug, lr_batch
> #PBS -q lr_batch
> #PBS -A ac_rad71t
> #PBS -l nodes=16:ppn=8:lr1
> #PBS -l walltime=24:00:00
> #PBS -m be
> #PBS -M amcneil at lbl.gov
> #PBS -e run_v4a.err
> #PBS -o run_v4a.out
>
> # change to working directory & run the program
> cd ~/models/wwr60
>
> for i in {0..127}; do
> pbsdsh -n $i $PBS_O_WORKDIR/run_render.csh views/v4a.vf $(printf "%03d"
> $i) 8 16 &
> done
>
> wait
>
>
>
>
> ### run_render.csh ######
> #! /bin/csh
>
> cd $PBS_O_WORKDIR
> set path=($path ~/applications/Radiance/bin/ )
>
> set oxres = 512
> set oyres = 512
>
> set view = $argv[1]
> set thispiece = $argv[2]
> set numcols = $argv[3]
> set numrows = $argv[4]
> set numpieces = `ev "$numcols * $numrows"`
>
> set pxres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print
> int($2/'$numcols'+.5)}'`
> set pyres = `vwrays -vf $view -x $oxres -y $oyres -d | awk '{print
> int($4/'$numrows'+.5)}'`
>
> set vtype = `awk '{for(i=1;i<NF;i++) if(match($i,"-vt")==1)
> split($i,vt,"")} END { print vt[4] }' $view`
> set vshift = `ev "$thispiece - $numcols * floor( $thispiece / $numcols)
> - $numcols / 2 + .5"`
> set vlift = `ev "floor( $thispiece / $numcols ) - $numrows / 2 + .5"`
>
> if ($vtype == "v") then
> set vhoriz = `awk 'BEGIN{PI=3.14159265} \
> {for(i=1;i<NF;i++) if($i=="-vh") vh=$(i+1)*PI/180 } \
> END{print atan2(sin(vh/2)/'$numcols',cos(vh/2))*180/PI*2}' $view`
> set vvert = `awk 'BEGIN{PI=3.14159265} \
> {for(i=1;i<NF;i++) if($i=="-vv") vv=$(i+1)*PI/180 } \
> END{print atan2(sin(vv/2)/'$numrows',cos(vv/2))*180/PI*2}' $view`
> endif
>
> vwrays -ff -vf $view -vv $vvert -vh $vhoriz -vs $vshift -vl $vlift -x
> $pxres -y $pyres \
> | rtcontrib -n 1 `vwrays -vf $view -vv $vvert -vh $vhoriz -vs $vshift
> -vl $vlift -x $pxres -y $pyres -d` \
> -ffc -fo \
> -o
> binpics/wwr60/${view:t:r}/${view:t:r}_wwr60_%s_%04d_${thispiece}.hdr \
> -f klems_horiz.cal -bn Nkbins \
> -b 'kbin(0,1,0,0,0,1)' -m GlDay -b 'kbin(0,1,0,0,0,1)' -m GlView \
> -w -ab 6 -ad 6000 -lw 1e-7 -ds .07 -dc 1 oct/vmx.oct
>
>
>
>
>
>
>
> On Apr 11, 2012, at 5:54 AM, Jack de Valpine wrote:
> Hi Randolph,
>
> All I have is Linux. Not sure what kernels at this point. But I have
> noticed this over multiple kernels and distributions. Although I have
> not run anything on the most recent kernels.
>
> I know that one thing I did was to disable the fork and wait
> functionality in rpiece to wait for a job to finish. I do not recall
> though if this was related to this problem, nfs locking, or running on
> a cluster with job distribution queuing....? Sorry I do not remember
> more right now.
>
> Just thinking out loud here, but if you are running on a cluster then
> could network latency also be an issue?
>
> Here is my suspicion/theory, which I have not been able to test. I
> think that somehow there is a race condition in the way jobs get forked
> off and status of pieces gets recorded in the syncfile...
>
> For testing/debugging purposes, a few things to look at compare might be:
>>>> • big scene - slow load time
>>>> • small scene - fast load time
>>>> • "fast" parameters - small image size with lots of divisions
>>>> • "slow" parameters - small image size with lots of divisions
> On my cluster, I ended up setting up things so that any initial small
> image run for building the ambient cache would actually just run as a
> single rpict process and then large images would get distributed across
> nodes/cores.
>
> As an aside, perhaps Rob G. has some thoughts on Radiance/Clusters as I
> think they have a large one also. What is the cluster set up at LBNL? I
> believe that at one point they were using a provisioning system called
> Warewulf which has now evolved to Perceus. I have the former setup and
> have not gotten around to the latter. LBNL may also be using a job
> queuing system called Slurm which they developed (or maybe that was at
> LLNL)?
>
> Hopefully this is not leading you off on the wrong track though.
> Probably would be useful to figure out if the problem is indeed rpiece
> related or something else entirely.
>
> -Jack
> --
> # Jack de Valpine
> # president
> #
> # visarc incorporated
> # http://www.visarc.com
> #
> # channeling technology for superior design and construction
>
> On 4/11/2012 1:27 AM, Randolph M. Fritz wrote:
> Thanks Jack, Greg.
>
> Jack, what kernel were you using? Was it also Linux?
>
> Greg, I was using rad, so those delays are already in there, alas. I
> wonder if there is some subtle difference between the Mac OS Mach
> kernel and the Linux kernel that's causing the problem, or if it occurs
> on all platforms, just more frequently in the very fast cluster nodes.
>
> Or, it could be an NFS locking problem, bah.
>
> If I find time, maybe I can dig into it some more. Right now, I may
> just finesse it by running multiple *different* simulations on the same
> cluster node.
>
> Randolph
>
> On 2012-04-09 21:52:47 +0000, Greg Ward said:
>
>> If it is a startup issue as Jack suggests, you might try inserting a
>> few seconds of delay between the spawning of each new rpiece process
>> using "sleep 5" or similar. This allows time for the sync file to be
>> updated without contention between processes. This is what I do in rad
>> with the -N option. I actually wait 10 seconds between each new rpiece
>> process.
>>
>> This isn't to say that I understand the source of your error, which
>> still puzzles me.
>>
>> -Greg
>>>>> From: Jack de Valpine <jedev at visarc.com>
>>>> Date: April 9, 2012 1:46:03 PM PDT
>>>
>> Hey Randolph,
>>
>> I have run into this before. Unfortunately I have had limited success
>> in tracking down the issue and also have not really looked at it for
>> some time. If I recall correctly, a couple of things that I have
>> noticed:
>>>> • possible problem if a piece finishes before the first set of pieces
>>>> are parcelled out out by rpiece - so if it 8 pieces are being
>>>> distributed at startup and piece 2 (for example) finishes before one of
>>>> pieces 1, 3, 4, 5, 6, 7, 8 has even been processed by rpiece or while
>>>> rpiece is still forking off the initial jobs.
>> Sorry I cannot offer more, I have spent some time in the code on this
>> one and it is not for the faint of heart to say the least.
>> -Jack
>> --
>> # Jack de Valpine
>> # president
>> #
>> # visarc incorporated
>> # http://www.visarc.com
>> #
>> # channeling technology for superior design and construction
>>
>> On 4/9/2012 3:29 PM, Randolph M. Fritz wrote:
>> This problem is back for a sequel, and it would really help my work if
>> I could get it going.
>>
>> It's been a few months since I last asked about this. Has anyone else
>> experienced this in a Linux environment? Anyone have any ideas what to
>> do about it or how to debug it?
>>
>> /proc/version reports:
>> Linux version 2.6.18-274.18.1.el5
>> (mockbuild-t2f/um9L7dhDWr0U+X5jBOG/Ez6ZCGd0 at public.gmane.orgorg)
>> (gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)) #1 SMP Thu Feb 9
>> 12:45:44 EST 2012
>>
>> Randolph
>>
>> On 2011-07-08 01:13:01 +0000, Randolph M. Fritz said:
>>
>> On 2011-07-07 16:54:06 -0700, Greg Ward said:
>>
>> Hi Randolph,
>>
>> This shouldn't happen, unless one of the rpict processes died
>> unexpectedly. Even then, I would expect some other kind of error to be
>> reported as well.
>>
>> -Greg
>>
>> Thanks, Greg. I think that's what happenned; in fact seven of the
>> eight died in two cases. Wierdly, the third succeeded. If I run it as
>> a single-processor job, it works. Here's a piece of the log:
>>
>> rpiece -F bl_blinds_rpsync.txt -PP pfLF5M90 -vtv -vp 60.0 -2.0 66.0 -vd
>> 12.0 0.0 0.0 -vu 0 0 1 -vh 60 -x 1024 -y 1024 -dp 512 -ar 42 -ms 3.6
>> -ds .3 -dt .1 -dc .5 -dr 1 -ss 1 -st .1 -af bl.amb -aa .1 -ad 1536 -as
>> 392 -av 10 10 10 -lr 8 -lw 1e-4 -ps 6 -pt .08 -o bl_blinds.unf bl.oct
>>
>> rpict: warning - no output produced
>>
>> rpict: system - write error in io_process: Broken pipe
>> rpict: 0 rays, 0.00% after 0.000u 0.000s 0.001r hours on n0065.lr1
>> rad: error rendering view blinds
>>
>>
>> _______________________________________________
>> Radiance-general mailing list
>> Radiance-general at radiance-online.org
>> http://www.radiance-online.org/mailman/listinfo/radiance-general
>> _______________________________________________
>> Radiance-general mailing list
>> Radiance-general at radiance-online.org
>> http://www.radiance-online.org/mailman/listinfo/radiance-general
>
>
> --
> Randolph M. Fritz
>
>
>
> _______________________________________________
> Radiance-general mailing list
> Radiance-general at radiance-online.org
> http://www.radiance-online.org/mailman/listinfo/radiance-general
> _______________________________________________
> Radiance-general mailing list
> Radiance-general at radiance-online.org
> http://www.radiance-online.org/mailman/listinfo/radiance-general
>
>
>
>
> _______________________________________________
> Radiance-general mailing list
> Radiance-general at radiance-online.org
> http://www.radiance-online.org/mailman/listinfo/radiance-general
> _______________________________________________
> Radiance-general mailing list
> Radiance-general at radiance-online.org
> http://www.radiance-online.org/mailman/listinfo/radiance-general
--
Randolph M. Fritz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.radiance-online.org/pipermail/radiance-general/attachments/20120416/2aff1e13/attachment-0001.html>
More information about the Radiance-general
mailing list