ViewVC Help
View File | Revision Log | Show Annotations | Download File | Root Listing
root/radiance/ray/doc/notes/parallel.txt
Revision: 1.1
Committed: Sat Mar 15 17:32:55 2003 UTC (21 years, 1 month ago) by greg
Content type: text/plain
Branch: MAIN
CVS Tags: rad5R4, rad5R2, rad4R2P2, rad5R0, rad5R1, rad3R7P2, rad3R7P1, rad4R2, rad4R1, rad4R0, rad3R5, rad3R6, rad3R6P1, rad3R8, rad3R9, rad4R2P1, rad5R3, HEAD
Log Message:
Added and updated documentation for 3.5 release

File Contents

# User Rev Content
1 greg 1.1 Parallel Rendering on the ICSD SPARC-10's
2    
3     Greg Ward
4     Energy and Environment Division
5    
6    
7     The Information and Computing Services Division was kind
8     enough to make 10 Sun SPARC-10's available on the network
9     for enterprising individuals who wished to perform experi-
10     ments in distributed parallel processing. This article
11     describes the method we developed to efficiently run an
12     incompletely parallelizable rendering program in a distri-
13     buted processing environment.
14    
15     The lighting simulation and rendering software we have
16     developed over the past 8 years, Radiance, has only recently
17     been made to work in parallel environments. Although paral-
18     lel ray tracing programs have been kicking around the graph-
19     ics community for several years, Radiance uses a modified
20     ray tracing algorithm that does not adapt as readily to a
21     parallel implementation. The main difference is that Radi-
22     ance produces illumination information that is globally
23     reused during the rendering of an image. Thus, spawning
24     disjoint processes to work on disjoint parts of an image
25     will not result in the linear speedup desired. Each
26     independent process would create its own set of "indirect
27     irradiance" values for its section of the image, and many of
28     these values would be redundant and would represent wasted
29     CPU time. It is therefore essential that this information
30     be shared among different processes working on the same
31     scene. The question is, how to do it?
32    
33     To minimize incompatibilities with different UNIX implemen-
34     tations, we decided early on in our parallel rendering work
35     to rely on the Network File System (NFS) only, imperfect as
36     it is. The chief feature that enables us to do parallel
37     rendering is NFS file locking, which is supported by most
38     current UNIX implementations. File locking allows a process
39     on the same machine or a different machine to restrict
40     access on any section of an open file that resides either
41     locally or on an NFS-mounted filesystem. Thus, data-sharing
42     is handled through the contents of an ordinary file and
43     coordinated by the network lock manager. This method can be
44     slow in states of high contention, therefore access fre-
45     quency must be kept low.
46    
47     In this article, we will refer to processes rather than
48     machines because the methods presented work both in cases of
49     multiple processors on a single machine and multiple
50     machines distributed over a network.
51    
52     The method we adopted for sharing our indirect irradiance
53     values is simple. Each process caches together a small
54     number of values (on the order of 16 -- enough to fill a
55     standard UNIX buffer) before appending these to a file. In
56     preparation for writing out its buffer, the process places
57     an exclusive lock on the file, then checks to see if it has
58     grown since the last time. If it has, the process reads in
59     the new information, assuming it has come from another pro-
60     cess that is legitimately working on this file. Finally,
61     the process flushes its buffer and releases the lock on the
62     file. The file thus contains the cumulative indirect irra-
63     diance calculations of all the processes, and every process
64     has this information stored also in memory (up until the
65     last time it flushed its buffer). Saving the information to
66     a file has the further advantage of providing a convenient
67     way to reuse the data for later renderings.
68    
69     The image to be rendered is divided into many small pieces,
70     more pieces than there are processors. This way, if one
71     piece takes longer than the others, the processors that had
72     easy pieces are not all waiting for the processor with the
73     difficult piece to finish. Coordination between processes
74     is again handled by the network lock manager. A file con-
75     tains the position of the last piece being worked on, and as
76     soon as a processor finishes its piece, it locks the file,
77     finds out what to work on next, increments the position and
78     unlocks the file again. Thus, there is no need for a single
79     controlling process, and rendering processes may be ini-
80     tiated and terminated at will.
81    
82     ICSD's offer to use their farm of SPARC-10's was an ideal
83     opportunity to test our programs under real conditions. The
84     problem at hand was producing numerically accurate, high-
85     resolution renderings of the lower deck of a ship under dif-
86     ferent lighting conditions. Three images were rendered one
87     at a time, with all 10 SPARC-10 machines working on each
88     image simultaneously. The wall time required to render one
89     image was about 4.3 hours. The first machine finished with
90     all it could do shortly after the last image piece was
91     assigned at 2.8 hours. Thus, many of the processors in our
92     test run were done before the entire image was complete.
93     This is a problem of not breaking the image into small
94     enough pieces for efficient processor allocation.
95    
96     For the time that the processors were running, all but one
97     had 98% or 99% CPU utilization. The one exception was the
98     file server, which had 94% CPU utilization. This means that
99     the processors were well saturated while working on our job,
100     not waiting for image piece assignments, disk access, etc.
101    
102     If we include the time at the end when some processors had
103     finished while others were still going, the effective CPU
104     utilization averaged 84%, with the lowest at 75%. Again,
105     this low figure was due to the fact that the picture should
106     have been divided into more than the 49 pieces we specified.
107     (The overall utilization was really better than this, since
108     we set the jobs up to run one after the other and once a
109     processor finished its part on one image it went on to work
110     on the next image.)
111    
112     The real proof of a parallel implementation is not CPU util-
113     ization, however, it is the speedup factor. To examine
114     this, it was necessary to start the job over, running on a
115     single processor. Running alone, one SPARC-10 took about 35
116     hours to finish an image, with 99% CPU utilization. That is
117     about 8.2 times as long as the total time required by 10
118     processors to finish the image (due mostly to idle proces-
119     sors at the end). This ratio, 8.2/10, is very close to the
120     average effective CPU utilization value of 84%, indicating
121     that parallel processing does not result in a lot of redun-
122     dant calculation.
123    
124     Our experience showed that an incompletely parallelizable
125     problem could be solved efficiently on distributed proces-
126     sors using NFS as a data sharing mechanism. The principle
127     lesson we learned from this exercise is that good utiliza-
128     tion of multiple processors requires that the job be broken
129     into small enough chunks. It is perhaps significant that
130     the time spent idle, 16%, corresponds roughly to the percen-
131     tage of the total time required by a processor to finish one
132     piece (since there were about 5 chunks for each processor).
133     If we were to decrease the size of the pieces so that each
134     processor got 20 pieces on average, we should expect the
135     idle time to go down to around 5%.