1 |
Parallel Rendering on the ICSD SPARC-10's |
2 |
|
3 |
Greg Ward |
4 |
Energy and Environment Division |
5 |
|
6 |
|
7 |
The Information and Computing Services Division was kind |
8 |
enough to make 10 Sun SPARC-10's available on the network |
9 |
for enterprising individuals who wished to perform experi- |
10 |
ments in distributed parallel processing. This article |
11 |
describes the method we developed to efficiently run an |
12 |
incompletely parallelizable rendering program in a distri- |
13 |
buted processing environment. |
14 |
|
15 |
The lighting simulation and rendering software we have |
16 |
developed over the past 8 years, Radiance, has only recently |
17 |
been made to work in parallel environments. Although paral- |
18 |
lel ray tracing programs have been kicking around the graph- |
19 |
ics community for several years, Radiance uses a modified |
20 |
ray tracing algorithm that does not adapt as readily to a |
21 |
parallel implementation. The main difference is that Radi- |
22 |
ance produces illumination information that is globally |
23 |
reused during the rendering of an image. Thus, spawning |
24 |
disjoint processes to work on disjoint parts of an image |
25 |
will not result in the linear speedup desired. Each |
26 |
independent process would create its own set of "indirect |
27 |
irradiance" values for its section of the image, and many of |
28 |
these values would be redundant and would represent wasted |
29 |
CPU time. It is therefore essential that this information |
30 |
be shared among different processes working on the same |
31 |
scene. The question is, how to do it? |
32 |
|
33 |
To minimize incompatibilities with different UNIX implemen- |
34 |
tations, we decided early on in our parallel rendering work |
35 |
to rely on the Network File System (NFS) only, imperfect as |
36 |
it is. The chief feature that enables us to do parallel |
37 |
rendering is NFS file locking, which is supported by most |
38 |
current UNIX implementations. File locking allows a process |
39 |
on the same machine or a different machine to restrict |
40 |
access on any section of an open file that resides either |
41 |
locally or on an NFS-mounted filesystem. Thus, data-sharing |
42 |
is handled through the contents of an ordinary file and |
43 |
coordinated by the network lock manager. This method can be |
44 |
slow in states of high contention, therefore access fre- |
45 |
quency must be kept low. |
46 |
|
47 |
In this article, we will refer to processes rather than |
48 |
machines because the methods presented work both in cases of |
49 |
multiple processors on a single machine and multiple |
50 |
machines distributed over a network. |
51 |
|
52 |
The method we adopted for sharing our indirect irradiance |
53 |
values is simple. Each process caches together a small |
54 |
number of values (on the order of 16 -- enough to fill a |
55 |
standard UNIX buffer) before appending these to a file. In |
56 |
preparation for writing out its buffer, the process places |
57 |
an exclusive lock on the file, then checks to see if it has |
58 |
grown since the last time. If it has, the process reads in |
59 |
the new information, assuming it has come from another pro- |
60 |
cess that is legitimately working on this file. Finally, |
61 |
the process flushes its buffer and releases the lock on the |
62 |
file. The file thus contains the cumulative indirect irra- |
63 |
diance calculations of all the processes, and every process |
64 |
has this information stored also in memory (up until the |
65 |
last time it flushed its buffer). Saving the information to |
66 |
a file has the further advantage of providing a convenient |
67 |
way to reuse the data for later renderings. |
68 |
|
69 |
The image to be rendered is divided into many small pieces, |
70 |
more pieces than there are processors. This way, if one |
71 |
piece takes longer than the others, the processors that had |
72 |
easy pieces are not all waiting for the processor with the |
73 |
difficult piece to finish. Coordination between processes |
74 |
is again handled by the network lock manager. A file con- |
75 |
tains the position of the last piece being worked on, and as |
76 |
soon as a processor finishes its piece, it locks the file, |
77 |
finds out what to work on next, increments the position and |
78 |
unlocks the file again. Thus, there is no need for a single |
79 |
controlling process, and rendering processes may be ini- |
80 |
tiated and terminated at will. |
81 |
|
82 |
ICSD's offer to use their farm of SPARC-10's was an ideal |
83 |
opportunity to test our programs under real conditions. The |
84 |
problem at hand was producing numerically accurate, high- |
85 |
resolution renderings of the lower deck of a ship under dif- |
86 |
ferent lighting conditions. Three images were rendered one |
87 |
at a time, with all 10 SPARC-10 machines working on each |
88 |
image simultaneously. The wall time required to render one |
89 |
image was about 4.3 hours. The first machine finished with |
90 |
all it could do shortly after the last image piece was |
91 |
assigned at 2.8 hours. Thus, many of the processors in our |
92 |
test run were done before the entire image was complete. |
93 |
This is a problem of not breaking the image into small |
94 |
enough pieces for efficient processor allocation. |
95 |
|
96 |
For the time that the processors were running, all but one |
97 |
had 98% or 99% CPU utilization. The one exception was the |
98 |
file server, which had 94% CPU utilization. This means that |
99 |
the processors were well saturated while working on our job, |
100 |
not waiting for image piece assignments, disk access, etc. |
101 |
|
102 |
If we include the time at the end when some processors had |
103 |
finished while others were still going, the effective CPU |
104 |
utilization averaged 84%, with the lowest at 75%. Again, |
105 |
this low figure was due to the fact that the picture should |
106 |
have been divided into more than the 49 pieces we specified. |
107 |
(The overall utilization was really better than this, since |
108 |
we set the jobs up to run one after the other and once a |
109 |
processor finished its part on one image it went on to work |
110 |
on the next image.) |
111 |
|
112 |
The real proof of a parallel implementation is not CPU util- |
113 |
ization, however, it is the speedup factor. To examine |
114 |
this, it was necessary to start the job over, running on a |
115 |
single processor. Running alone, one SPARC-10 took about 35 |
116 |
hours to finish an image, with 99% CPU utilization. That is |
117 |
about 8.2 times as long as the total time required by 10 |
118 |
processors to finish the image (due mostly to idle proces- |
119 |
sors at the end). This ratio, 8.2/10, is very close to the |
120 |
average effective CPU utilization value of 84%, indicating |
121 |
that parallel processing does not result in a lot of redun- |
122 |
dant calculation. |
123 |
|
124 |
Our experience showed that an incompletely parallelizable |
125 |
problem could be solved efficiently on distributed proces- |
126 |
sors using NFS as a data sharing mechanism. The principle |
127 |
lesson we learned from this exercise is that good utiliza- |
128 |
tion of multiple processors requires that the job be broken |
129 |
into small enough chunks. It is perhaps significant that |
130 |
the time spent idle, 16%, corresponds roughly to the percen- |
131 |
tage of the total time required by a processor to finish one |
132 |
piece (since there were about 5 chunks for each processor). |
133 |
If we were to decrease the size of the pieces so that each |
134 |
processor got 20 pieces on average, we should expect the |
135 |
idle time to go down to around 5%. |