| 1 |
Parallel Rendering on the ICSD SPARC-10's |
| 2 |
|
| 3 |
Greg Ward |
| 4 |
Energy and Environment Division |
| 5 |
|
| 6 |
|
| 7 |
The Information and Computing Services Division was kind |
| 8 |
enough to make 10 Sun SPARC-10's available on the network |
| 9 |
for enterprising individuals who wished to perform experi- |
| 10 |
ments in distributed parallel processing. This article |
| 11 |
describes the method we developed to efficiently run an |
| 12 |
incompletely parallelizable rendering program in a distri- |
| 13 |
buted processing environment. |
| 14 |
|
| 15 |
The lighting simulation and rendering software we have |
| 16 |
developed over the past 8 years, Radiance, has only recently |
| 17 |
been made to work in parallel environments. Although paral- |
| 18 |
lel ray tracing programs have been kicking around the graph- |
| 19 |
ics community for several years, Radiance uses a modified |
| 20 |
ray tracing algorithm that does not adapt as readily to a |
| 21 |
parallel implementation. The main difference is that Radi- |
| 22 |
ance produces illumination information that is globally |
| 23 |
reused during the rendering of an image. Thus, spawning |
| 24 |
disjoint processes to work on disjoint parts of an image |
| 25 |
will not result in the linear speedup desired. Each |
| 26 |
independent process would create its own set of "indirect |
| 27 |
irradiance" values for its section of the image, and many of |
| 28 |
these values would be redundant and would represent wasted |
| 29 |
CPU time. It is therefore essential that this information |
| 30 |
be shared among different processes working on the same |
| 31 |
scene. The question is, how to do it? |
| 32 |
|
| 33 |
To minimize incompatibilities with different UNIX implemen- |
| 34 |
tations, we decided early on in our parallel rendering work |
| 35 |
to rely on the Network File System (NFS) only, imperfect as |
| 36 |
it is. The chief feature that enables us to do parallel |
| 37 |
rendering is NFS file locking, which is supported by most |
| 38 |
current UNIX implementations. File locking allows a process |
| 39 |
on the same machine or a different machine to restrict |
| 40 |
access on any section of an open file that resides either |
| 41 |
locally or on an NFS-mounted filesystem. Thus, data-sharing |
| 42 |
is handled through the contents of an ordinary file and |
| 43 |
coordinated by the network lock manager. This method can be |
| 44 |
slow in states of high contention, therefore access fre- |
| 45 |
quency must be kept low. |
| 46 |
|
| 47 |
In this article, we will refer to processes rather than |
| 48 |
machines because the methods presented work both in cases of |
| 49 |
multiple processors on a single machine and multiple |
| 50 |
machines distributed over a network. |
| 51 |
|
| 52 |
The method we adopted for sharing our indirect irradiance |
| 53 |
values is simple. Each process caches together a small |
| 54 |
number of values (on the order of 16 -- enough to fill a |
| 55 |
standard UNIX buffer) before appending these to a file. In |
| 56 |
preparation for writing out its buffer, the process places |
| 57 |
an exclusive lock on the file, then checks to see if it has |
| 58 |
grown since the last time. If it has, the process reads in |
| 59 |
the new information, assuming it has come from another pro- |
| 60 |
cess that is legitimately working on this file. Finally, |
| 61 |
the process flushes its buffer and releases the lock on the |
| 62 |
file. The file thus contains the cumulative indirect irra- |
| 63 |
diance calculations of all the processes, and every process |
| 64 |
has this information stored also in memory (up until the |
| 65 |
last time it flushed its buffer). Saving the information to |
| 66 |
a file has the further advantage of providing a convenient |
| 67 |
way to reuse the data for later renderings. |
| 68 |
|
| 69 |
The image to be rendered is divided into many small pieces, |
| 70 |
more pieces than there are processors. This way, if one |
| 71 |
piece takes longer than the others, the processors that had |
| 72 |
easy pieces are not all waiting for the processor with the |
| 73 |
difficult piece to finish. Coordination between processes |
| 74 |
is again handled by the network lock manager. A file con- |
| 75 |
tains the position of the last piece being worked on, and as |
| 76 |
soon as a processor finishes its piece, it locks the file, |
| 77 |
finds out what to work on next, increments the position and |
| 78 |
unlocks the file again. Thus, there is no need for a single |
| 79 |
controlling process, and rendering processes may be ini- |
| 80 |
tiated and terminated at will. |
| 81 |
|
| 82 |
ICSD's offer to use their farm of SPARC-10's was an ideal |
| 83 |
opportunity to test our programs under real conditions. The |
| 84 |
problem at hand was producing numerically accurate, high- |
| 85 |
resolution renderings of the lower deck of a ship under dif- |
| 86 |
ferent lighting conditions. Three images were rendered one |
| 87 |
at a time, with all 10 SPARC-10 machines working on each |
| 88 |
image simultaneously. The wall time required to render one |
| 89 |
image was about 4.3 hours. The first machine finished with |
| 90 |
all it could do shortly after the last image piece was |
| 91 |
assigned at 2.8 hours. Thus, many of the processors in our |
| 92 |
test run were done before the entire image was complete. |
| 93 |
This is a problem of not breaking the image into small |
| 94 |
enough pieces for efficient processor allocation. |
| 95 |
|
| 96 |
For the time that the processors were running, all but one |
| 97 |
had 98% or 99% CPU utilization. The one exception was the |
| 98 |
file server, which had 94% CPU utilization. This means that |
| 99 |
the processors were well saturated while working on our job, |
| 100 |
not waiting for image piece assignments, disk access, etc. |
| 101 |
|
| 102 |
If we include the time at the end when some processors had |
| 103 |
finished while others were still going, the effective CPU |
| 104 |
utilization averaged 84%, with the lowest at 75%. Again, |
| 105 |
this low figure was due to the fact that the picture should |
| 106 |
have been divided into more than the 49 pieces we specified. |
| 107 |
(The overall utilization was really better than this, since |
| 108 |
we set the jobs up to run one after the other and once a |
| 109 |
processor finished its part on one image it went on to work |
| 110 |
on the next image.) |
| 111 |
|
| 112 |
The real proof of a parallel implementation is not CPU util- |
| 113 |
ization, however, it is the speedup factor. To examine |
| 114 |
this, it was necessary to start the job over, running on a |
| 115 |
single processor. Running alone, one SPARC-10 took about 35 |
| 116 |
hours to finish an image, with 99% CPU utilization. That is |
| 117 |
about 8.2 times as long as the total time required by 10 |
| 118 |
processors to finish the image (due mostly to idle proces- |
| 119 |
sors at the end). This ratio, 8.2/10, is very close to the |
| 120 |
average effective CPU utilization value of 84%, indicating |
| 121 |
that parallel processing does not result in a lot of redun- |
| 122 |
dant calculation. |
| 123 |
|
| 124 |
Our experience showed that an incompletely parallelizable |
| 125 |
problem could be solved efficiently on distributed proces- |
| 126 |
sors using NFS as a data sharing mechanism. The principle |
| 127 |
lesson we learned from this exercise is that good utiliza- |
| 128 |
tion of multiple processors requires that the job be broken |
| 129 |
into small enough chunks. It is perhaps significant that |
| 130 |
the time spent idle, 16%, corresponds roughly to the percen- |
| 131 |
tage of the total time required by a processor to finish one |
| 132 |
piece (since there were about 5 chunks for each processor). |
| 133 |
If we were to decrease the size of the pieces so that each |
| 134 |
processor got 20 pieces on average, we should expect the |
| 135 |
idle time to go down to around 5%. |