doc/notes/parallel.txt

         Parallel Rendering on the ICSD SPARC-10's

                         Greg Ward
              Energy and Environment Division


The Information and Computing  Services  Division  was  kind
enough  to  make  10 Sun SPARC-10's available on the network
for enterprising individuals who wished to  perform  experi-
ments  in  distributed  parallel  processing.   This article
describes the method we  developed  to  efficiently  run  an
incompletely  parallelizable  rendering program in a distri-
buted processing environment.

The lighting  simulation  and  rendering  software  we  have
developed over the past 8 years, Radiance, has only recently
been made to work in parallel environments.  Although paral-
lel ray tracing programs have been kicking around the graph-
ics community for several years, Radiance  uses  a  modified
ray  tracing  algorithm  that does not adapt as readily to a
parallel implementation.  The main difference is that  Radi-
ance  produces  illumination  information  that  is globally
reused during the rendering of  an  image.   Thus,  spawning
disjoint  processes  to  work  on disjoint parts of an image
will  not  result  in  the  linear  speedup  desired.   Each
independent  process  would  create its own set of "indirect
irradiance" values for its section of the image, and many of
these  values  would be redundant and would represent wasted
CPU time.  It is therefore essential that  this  information
be  shared  among  different  processes  working on the same
scene.  The question is, how to do it?

To minimize incompatibilities with different UNIX  implemen-
tations,  we decided early on in our parallel rendering work
to rely on the Network File System (NFS) only, imperfect  as
it  is.   The  chief  feature that enables us to do parallel
rendering is NFS file locking, which is  supported  by  most
current UNIX implementations.  File locking allows a process
on the same machine  or  a  different  machine  to  restrict
access  on  any  section of an open file that resides either
locally or on an NFS-mounted filesystem.  Thus, data-sharing
is  handled  through  the  contents  of an ordinary file and
coordinated by the network lock manager.  This method can be
slow  in  states  of  high contention, therefore access fre-
quency must be kept low.

In this article, we will  refer  to  processes  rather  than
machines because the methods presented work both in cases of
multiple  processors  on  a  single  machine  and   multiple
machines distributed over a network.

The method we adopted for sharing  our  indirect  irradiance
values  is  simple.   Each  process  caches together a small
number of values (on the order of 16 --  enough  to  fill  a
standard  UNIX buffer) before appending these to a file.  In
preparation for writing out its buffer, the  process  places
an  exclusive lock on the file, then checks to see if it has
grown since the last time.  If it has, the process reads  in
the  new information, assuming it has come from another pro-
cess that is legitimately working on  this  file.   Finally,
the  process flushes its buffer and releases the lock on the
file.  The file thus contains the cumulative indirect  irra-
diance  calculations of all the processes, and every process
has this information stored also in  memory  (up  until  the
last time it flushed its buffer).  Saving the information to
a file has the further advantage of providing  a  convenient
way to reuse the data for later renderings.

The image to be rendered is divided into many small  pieces,
more  pieces  than  there  are processors.  This way, if one
piece takes longer than the others, the processors that  had
easy  pieces  are not all waiting for the processor with the
difficult piece to finish.  Coordination  between  processes
is  again  handled by the network lock manager.  A file con-
tains the position of the last piece being worked on, and as
soon  as  a processor finishes its piece, it locks the file,
finds out what to work on next, increments the position  and
unlocks the file again.  Thus, there is no need for a single
controlling process, and rendering  processes  may  be  ini-
tiated and terminated at will.

ICSD's offer to use their farm of SPARC-10's  was  an  ideal
opportunity to test our programs under real conditions.  The
problem at hand was producing  numerically  accurate,  high-
resolution renderings of the lower deck of a ship under dif-
ferent lighting conditions.  Three images were rendered  one
at  a  time,  with  all 10 SPARC-10 machines working on each
image simultaneously.  The wall time required to render  one
image  was about 4.3 hours.  The first machine finished with
all it could do shortly  after  the  last  image  piece  was
assigned  at 2.8 hours.  Thus, many of the processors in our
test run were done before the  entire  image  was  complete.
This  is  a  problem  of  not  breaking the image into small
enough pieces for efficient processor allocation.

For the time that the processors were running, all  but  one
had  98%  or 99% CPU utilization.  The one exception was the
file server, which had 94% CPU utilization.  This means that
the processors were well saturated while working on our job,
not waiting for image piece assignments, disk  access,  etc.

If  we  include the time at the end when some processors had
finished while others were still going,  the  effective  CPU
utilization  averaged  84%,  with the lowest at 75%.  Again,
this low figure was due to the fact that the picture  should
have been divided into more than the 49 pieces we specified.
(The overall utilization was really better than this,  since
we  set  the  jobs  up to run one after the other and once a
processor finished its part on one image it went on to  work
on the next image.)

The real proof of a parallel implementation is not CPU util-
ization,  however,  it  is  the  speedup factor.  To examine
this, it was necessary to start the job over, running  on  a
single processor.  Running alone, one SPARC-10 took about 35
hours to finish an image, with 99% CPU utilization.  That is
about  8.2  times  as  long as the total time required by 10
processors to finish the image (due mostly to  idle  proces-
sors  at the end).  This ratio, 8.2/10, is very close to the
average effective CPU utilization value of  84%,  indicating
that  parallel processing does not result in a lot of redun-
dant calculation.

Our experience showed that  an  incompletely  parallelizable
problem  could  be solved efficiently on distributed proces-
sors using NFS as a data sharing mechanism.   The  principle
lesson  we  learned from this exercise is that good utiliza-
tion of multiple processors requires that the job be  broken
into  small  enough  chunks.  It is perhaps significant that
the time spent idle, 16%, corresponds roughly to the percen-
tage of the total time required by a processor to finish one
piece (since there were about 5 chunks for each  processor).
If  we  were to decrease the size of the pieces so that each
processor got 20 pieces on average,  we  should  expect  the
idle time to go down to around 5%.
Revision:	1.1
Committed:	Sat Mar 15 17:32:55 2003 UTC (21 years, 7 months ago) by greg
Content type:	text/plain
Branch:	MAIN
CVS Tags:	rad5R4, rad5R2, rad4R2P2, rad5R0, rad5R1, rad3R7P2, rad3R7P1, rad4R2, rad4R1, rad4R0, rad3R5, rad3R6, rad3R6P1, rad3R8, rad3R9, rad4R2P1, rad5R3, HEAD
Log Message:	Added and updated documentation for 3.5 release
#	User	Rev	Content
1	greg	1.1	Parallel Rendering on the ICSD SPARC-10's
2
3			Greg Ward
4			Energy and Environment Division
5
6
7			The Information and Computing Services Division was kind
8			enough to make 10 Sun SPARC-10's available on the network
9			for enterprising individuals who wished to perform experi-
10			ments in distributed parallel processing. This article
11			describes the method we developed to efficiently run an
12			incompletely parallelizable rendering program in a distri-
13			buted processing environment.
14
15			The lighting simulation and rendering software we have
16			developed over the past 8 years, Radiance, has only recently
17			been made to work in parallel environments. Although paral-
18			lel ray tracing programs have been kicking around the graph-
19			ics community for several years, Radiance uses a modified
20			ray tracing algorithm that does not adapt as readily to a
21			parallel implementation. The main difference is that Radi-
22			ance produces illumination information that is globally
23			reused during the rendering of an image. Thus, spawning
24			disjoint processes to work on disjoint parts of an image
25			will not result in the linear speedup desired. Each
26			independent process would create its own set of "indirect
27			irradiance" values for its section of the image, and many of
28			these values would be redundant and would represent wasted
29			CPU time. It is therefore essential that this information
30			be shared among different processes working on the same
31			scene. The question is, how to do it?
32
33			To minimize incompatibilities with different UNIX implemen-
34			tations, we decided early on in our parallel rendering work
35			to rely on the Network File System (NFS) only, imperfect as
36			it is. The chief feature that enables us to do parallel
37			rendering is NFS file locking, which is supported by most
38			current UNIX implementations. File locking allows a process
39			on the same machine or a different machine to restrict
40			access on any section of an open file that resides either
41			locally or on an NFS-mounted filesystem. Thus, data-sharing
42			is handled through the contents of an ordinary file and
43			coordinated by the network lock manager. This method can be
44			slow in states of high contention, therefore access fre-
45			quency must be kept low.
46
47			In this article, we will refer to processes rather than
48			machines because the methods presented work both in cases of
49			multiple processors on a single machine and multiple
50			machines distributed over a network.
51
52			The method we adopted for sharing our indirect irradiance
53			values is simple. Each process caches together a small
54			number of values (on the order of 16 -- enough to fill a
55			standard UNIX buffer) before appending these to a file. In
56			preparation for writing out its buffer, the process places
57			an exclusive lock on the file, then checks to see if it has
58			grown since the last time. If it has, the process reads in
59			the new information, assuming it has come from another pro-
60			cess that is legitimately working on this file. Finally,
61			the process flushes its buffer and releases the lock on the
62			file. The file thus contains the cumulative indirect irra-
63			diance calculations of all the processes, and every process
64			has this information stored also in memory (up until the
65			last time it flushed its buffer). Saving the information to
66			a file has the further advantage of providing a convenient
67			way to reuse the data for later renderings.
68
69			The image to be rendered is divided into many small pieces,
70			more pieces than there are processors. This way, if one
71			piece takes longer than the others, the processors that had
72			easy pieces are not all waiting for the processor with the
73			difficult piece to finish. Coordination between processes
74			is again handled by the network lock manager. A file con-
75			tains the position of the last piece being worked on, and as
76			soon as a processor finishes its piece, it locks the file,
77			finds out what to work on next, increments the position and
78			unlocks the file again. Thus, there is no need for a single
79			controlling process, and rendering processes may be ini-
80			tiated and terminated at will.
81
82			ICSD's offer to use their farm of SPARC-10's was an ideal
83			opportunity to test our programs under real conditions. The
84			problem at hand was producing numerically accurate, high-
85			resolution renderings of the lower deck of a ship under dif-
86			ferent lighting conditions. Three images were rendered one
87			at a time, with all 10 SPARC-10 machines working on each
88			image simultaneously. The wall time required to render one
89			image was about 4.3 hours. The first machine finished with
90			all it could do shortly after the last image piece was
91			assigned at 2.8 hours. Thus, many of the processors in our
92			test run were done before the entire image was complete.
93			This is a problem of not breaking the image into small
94			enough pieces for efficient processor allocation.
95
96			For the time that the processors were running, all but one
97			had 98% or 99% CPU utilization. The one exception was the
98			file server, which had 94% CPU utilization. This means that
99			the processors were well saturated while working on our job,
100			not waiting for image piece assignments, disk access, etc.
101
102			If we include the time at the end when some processors had
103			finished while others were still going, the effective CPU
104			utilization averaged 84%, with the lowest at 75%. Again,
105			this low figure was due to the fact that the picture should
106			have been divided into more than the 49 pieces we specified.
107			(The overall utilization was really better than this, since
108			we set the jobs up to run one after the other and once a
109			processor finished its part on one image it went on to work
110			on the next image.)
111
112			The real proof of a parallel implementation is not CPU util-
113			ization, however, it is the speedup factor. To examine
114			this, it was necessary to start the job over, running on a
115			single processor. Running alone, one SPARC-10 took about 35
116			hours to finish an image, with 99% CPU utilization. That is
117			about 8.2 times as long as the total time required by 10
118			processors to finish the image (due mostly to idle proces-
119			sors at the end). This ratio, 8.2/10, is very close to the
120			average effective CPU utilization value of 84%, indicating
121			that parallel processing does not result in a lot of redun-
122			dant calculation.
123
124			Our experience showed that an incompletely parallelizable
125			problem could be solved efficiently on distributed proces-
126			sors using NFS as a data sharing mechanism. The principle
127			lesson we learned from this exercise is that good utiliza-
128			tion of multiple processors requires that the job be broken
129			into small enough chunks. It is perhaps significant that
130			the time spent idle, 16%, corresponds roughly to the percen-
131			tage of the total time required by a processor to finish one
132			piece (since there were about 5 chunks for each processor).
133			If we were to decrease the size of the pieces so that each
134			processor got 20 pieces on average, we should expect the
135			idle time to go down to around 5%.