| 1 |
Parallel Rendering on the ICSD SPARC-10's
|
| 2 |
|
| 3 |
Greg Ward
|
| 4 |
Energy and Environment Division
|
| 5 |
|
| 6 |
|
| 7 |
The Information and Computing Services Division was kind
|
| 8 |
enough to make 10 Sun SPARC-10's available on the network
|
| 9 |
for enterprising individuals who wished to perform experi-
|
| 10 |
ments in distributed parallel processing. This article
|
| 11 |
describes the method we developed to efficiently run an
|
| 12 |
incompletely parallelizable rendering program in a distri-
|
| 13 |
buted processing environment.
|
| 14 |
|
| 15 |
The lighting simulation and rendering software we have
|
| 16 |
developed over the past 8 years, Radiance, has only recently
|
| 17 |
been made to work in parallel environments. Although paral-
|
| 18 |
lel ray tracing programs have been kicking around the graph-
|
| 19 |
ics community for several years, Radiance uses a modified
|
| 20 |
ray tracing algorithm that does not adapt as readily to a
|
| 21 |
parallel implementation. The main difference is that Radi-
|
| 22 |
ance produces illumination information that is globally
|
| 23 |
reused during the rendering of an image. Thus, spawning
|
| 24 |
disjoint processes to work on disjoint parts of an image
|
| 25 |
will not result in the linear speedup desired. Each
|
| 26 |
independent process would create its own set of "indirect
|
| 27 |
irradiance" values for its section of the image, and many of
|
| 28 |
these values would be redundant and would represent wasted
|
| 29 |
CPU time. It is therefore essential that this information
|
| 30 |
be shared among different processes working on the same
|
| 31 |
scene. The question is, how to do it?
|
| 32 |
|
| 33 |
To minimize incompatibilities with different UNIX implemen-
|
| 34 |
tations, we decided early on in our parallel rendering work
|
| 35 |
to rely on the Network File System (NFS) only, imperfect as
|
| 36 |
it is. The chief feature that enables us to do parallel
|
| 37 |
rendering is NFS file locking, which is supported by most
|
| 38 |
current UNIX implementations. File locking allows a process
|
| 39 |
on the same machine or a different machine to restrict
|
| 40 |
access on any section of an open file that resides either
|
| 41 |
locally or on an NFS-mounted filesystem. Thus, data-sharing
|
| 42 |
is handled through the contents of an ordinary file and
|
| 43 |
coordinated by the network lock manager. This method can be
|
| 44 |
slow in states of high contention, therefore access fre-
|
| 45 |
quency must be kept low.
|
| 46 |
|
| 47 |
In this article, we will refer to processes rather than
|
| 48 |
machines because the methods presented work both in cases of
|
| 49 |
multiple processors on a single machine and multiple
|
| 50 |
machines distributed over a network.
|
| 51 |
|
| 52 |
The method we adopted for sharing our indirect irradiance
|
| 53 |
values is simple. Each process caches together a small
|
| 54 |
number of values (on the order of 16 -- enough to fill a
|
| 55 |
standard UNIX buffer) before appending these to a file. In
|
| 56 |
preparation for writing out its buffer, the process places
|
| 57 |
an exclusive lock on the file, then checks to see if it has
|
| 58 |
grown since the last time. If it has, the process reads in
|
| 59 |
the new information, assuming it has come from another pro-
|
| 60 |
cess that is legitimately working on this file. Finally,
|
| 61 |
the process flushes its buffer and releases the lock on the
|
| 62 |
file. The file thus contains the cumulative indirect irra-
|
| 63 |
diance calculations of all the processes, and every process
|
| 64 |
has this information stored also in memory (up until the
|
| 65 |
last time it flushed its buffer). Saving the information to
|
| 66 |
a file has the further advantage of providing a convenient
|
| 67 |
way to reuse the data for later renderings.
|
| 68 |
|
| 69 |
The image to be rendered is divided into many small pieces,
|
| 70 |
more pieces than there are processors. This way, if one
|
| 71 |
piece takes longer than the others, the processors that had
|
| 72 |
easy pieces are not all waiting for the processor with the
|
| 73 |
difficult piece to finish. Coordination between processes
|
| 74 |
is again handled by the network lock manager. A file con-
|
| 75 |
tains the position of the last piece being worked on, and as
|
| 76 |
soon as a processor finishes its piece, it locks the file,
|
| 77 |
finds out what to work on next, increments the position and
|
| 78 |
unlocks the file again. Thus, there is no need for a single
|
| 79 |
controlling process, and rendering processes may be ini-
|
| 80 |
tiated and terminated at will.
|
| 81 |
|
| 82 |
ICSD's offer to use their farm of SPARC-10's was an ideal
|
| 83 |
opportunity to test our programs under real conditions. The
|
| 84 |
problem at hand was producing numerically accurate, high-
|
| 85 |
resolution renderings of the lower deck of a ship under dif-
|
| 86 |
ferent lighting conditions. Three images were rendered one
|
| 87 |
at a time, with all 10 SPARC-10 machines working on each
|
| 88 |
image simultaneously. The wall time required to render one
|
| 89 |
image was about 4.3 hours. The first machine finished with
|
| 90 |
all it could do shortly after the last image piece was
|
| 91 |
assigned at 2.8 hours. Thus, many of the processors in our
|
| 92 |
test run were done before the entire image was complete.
|
| 93 |
This is a problem of not breaking the image into small
|
| 94 |
enough pieces for efficient processor allocation.
|
| 95 |
|
| 96 |
For the time that the processors were running, all but one
|
| 97 |
had 98% or 99% CPU utilization. The one exception was the
|
| 98 |
file server, which had 94% CPU utilization. This means that
|
| 99 |
the processors were well saturated while working on our job,
|
| 100 |
not waiting for image piece assignments, disk access, etc.
|
| 101 |
|
| 102 |
If we include the time at the end when some processors had
|
| 103 |
finished while others were still going, the effective CPU
|
| 104 |
utilization averaged 84%, with the lowest at 75%. Again,
|
| 105 |
this low figure was due to the fact that the picture should
|
| 106 |
have been divided into more than the 49 pieces we specified.
|
| 107 |
(The overall utilization was really better than this, since
|
| 108 |
we set the jobs up to run one after the other and once a
|
| 109 |
processor finished its part on one image it went on to work
|
| 110 |
on the next image.)
|
| 111 |
|
| 112 |
The real proof of a parallel implementation is not CPU util-
|
| 113 |
ization, however, it is the speedup factor. To examine
|
| 114 |
this, it was necessary to start the job over, running on a
|
| 115 |
single processor. Running alone, one SPARC-10 took about 35
|
| 116 |
hours to finish an image, with 99% CPU utilization. That is
|
| 117 |
about 8.2 times as long as the total time required by 10
|
| 118 |
processors to finish the image (due mostly to idle proces-
|
| 119 |
sors at the end). This ratio, 8.2/10, is very close to the
|
| 120 |
average effective CPU utilization value of 84%, indicating
|
| 121 |
that parallel processing does not result in a lot of redun-
|
| 122 |
dant calculation.
|
| 123 |
|
| 124 |
Our experience showed that an incompletely parallelizable
|
| 125 |
problem could be solved efficiently on distributed proces-
|
| 126 |
sors using NFS as a data sharing mechanism. The principle
|
| 127 |
lesson we learned from this exercise is that good utiliza-
|
| 128 |
tion of multiple processors requires that the job be broken
|
| 129 |
into small enough chunks. It is perhaps significant that
|
| 130 |
the time spent idle, 16%, corresponds roughly to the percen-
|
| 131 |
tage of the total time required by a processor to finish one
|
| 132 |
piece (since there were about 5 chunks for each processor).
|
| 133 |
If we were to decrease the size of the pieces so that each
|
| 134 |
processor got 20 pieces on average, we should expect the
|
| 135 |
idle time to go down to around 5%.
|