| 1 | greg | 1.1 | Parallel Rendering on the ICSD SPARC-10's | 
| 2 |  |  |  | 
| 3 |  |  | Greg Ward | 
| 4 |  |  | Energy and Environment Division | 
| 5 |  |  |  | 
| 6 |  |  |  | 
| 7 |  |  | The Information and Computing  Services  Division  was  kind | 
| 8 |  |  | enough  to  make  10 Sun SPARC-10's available on the network | 
| 9 |  |  | for enterprising individuals who wished to  perform  experi- | 
| 10 |  |  | ments  in  distributed  parallel  processing.   This article | 
| 11 |  |  | describes the method we  developed  to  efficiently  run  an | 
| 12 |  |  | incompletely  parallelizable  rendering program in a distri- | 
| 13 |  |  | buted processing environment. | 
| 14 |  |  |  | 
| 15 |  |  | The lighting  simulation  and  rendering  software  we  have | 
| 16 |  |  | developed over the past 8 years, Radiance, has only recently | 
| 17 |  |  | been made to work in parallel environments.  Although paral- | 
| 18 |  |  | lel ray tracing programs have been kicking around the graph- | 
| 19 |  |  | ics community for several years, Radiance  uses  a  modified | 
| 20 |  |  | ray  tracing  algorithm  that does not adapt as readily to a | 
| 21 |  |  | parallel implementation.  The main difference is that  Radi- | 
| 22 |  |  | ance  produces  illumination  information  that  is globally | 
| 23 |  |  | reused during the rendering of  an  image.   Thus,  spawning | 
| 24 |  |  | disjoint  processes  to  work  on disjoint parts of an image | 
| 25 |  |  | will  not  result  in  the  linear  speedup  desired.   Each | 
| 26 |  |  | independent  process  would  create its own set of "indirect | 
| 27 |  |  | irradiance" values for its section of the image, and many of | 
| 28 |  |  | these  values  would be redundant and would represent wasted | 
| 29 |  |  | CPU time.  It is therefore essential that  this  information | 
| 30 |  |  | be  shared  among  different  processes  working on the same | 
| 31 |  |  | scene.  The question is, how to do it? | 
| 32 |  |  |  | 
| 33 |  |  | To minimize incompatibilities with different UNIX  implemen- | 
| 34 |  |  | tations,  we decided early on in our parallel rendering work | 
| 35 |  |  | to rely on the Network File System (NFS) only, imperfect  as | 
| 36 |  |  | it  is.   The  chief  feature that enables us to do parallel | 
| 37 |  |  | rendering is NFS file locking, which is  supported  by  most | 
| 38 |  |  | current UNIX implementations.  File locking allows a process | 
| 39 |  |  | on the same machine  or  a  different  machine  to  restrict | 
| 40 |  |  | access  on  any  section of an open file that resides either | 
| 41 |  |  | locally or on an NFS-mounted filesystem.  Thus, data-sharing | 
| 42 |  |  | is  handled  through  the  contents  of an ordinary file and | 
| 43 |  |  | coordinated by the network lock manager.  This method can be | 
| 44 |  |  | slow  in  states  of  high contention, therefore access fre- | 
| 45 |  |  | quency must be kept low. | 
| 46 |  |  |  | 
| 47 |  |  | In this article, we will  refer  to  processes  rather  than | 
| 48 |  |  | machines because the methods presented work both in cases of | 
| 49 |  |  | multiple  processors  on  a  single  machine  and   multiple | 
| 50 |  |  | machines distributed over a network. | 
| 51 |  |  |  | 
| 52 |  |  | The method we adopted for sharing  our  indirect  irradiance | 
| 53 |  |  | values  is  simple.   Each  process  caches together a small | 
| 54 |  |  | number of values (on the order of 16 --  enough  to  fill  a | 
| 55 |  |  | standard  UNIX buffer) before appending these to a file.  In | 
| 56 |  |  | preparation for writing out its buffer, the  process  places | 
| 57 |  |  | an  exclusive lock on the file, then checks to see if it has | 
| 58 |  |  | grown since the last time.  If it has, the process reads  in | 
| 59 |  |  | the  new information, assuming it has come from another pro- | 
| 60 |  |  | cess that is legitimately working on  this  file.   Finally, | 
| 61 |  |  | the  process flushes its buffer and releases the lock on the | 
| 62 |  |  | file.  The file thus contains the cumulative indirect  irra- | 
| 63 |  |  | diance  calculations of all the processes, and every process | 
| 64 |  |  | has this information stored also in  memory  (up  until  the | 
| 65 |  |  | last time it flushed its buffer).  Saving the information to | 
| 66 |  |  | a file has the further advantage of providing  a  convenient | 
| 67 |  |  | way to reuse the data for later renderings. | 
| 68 |  |  |  | 
| 69 |  |  | The image to be rendered is divided into many small  pieces, | 
| 70 |  |  | more  pieces  than  there  are processors.  This way, if one | 
| 71 |  |  | piece takes longer than the others, the processors that  had | 
| 72 |  |  | easy  pieces  are not all waiting for the processor with the | 
| 73 |  |  | difficult piece to finish.  Coordination  between  processes | 
| 74 |  |  | is  again  handled by the network lock manager.  A file con- | 
| 75 |  |  | tains the position of the last piece being worked on, and as | 
| 76 |  |  | soon  as  a processor finishes its piece, it locks the file, | 
| 77 |  |  | finds out what to work on next, increments the position  and | 
| 78 |  |  | unlocks the file again.  Thus, there is no need for a single | 
| 79 |  |  | controlling process, and rendering  processes  may  be  ini- | 
| 80 |  |  | tiated and terminated at will. | 
| 81 |  |  |  | 
| 82 |  |  | ICSD's offer to use their farm of SPARC-10's  was  an  ideal | 
| 83 |  |  | opportunity to test our programs under real conditions.  The | 
| 84 |  |  | problem at hand was producing  numerically  accurate,  high- | 
| 85 |  |  | resolution renderings of the lower deck of a ship under dif- | 
| 86 |  |  | ferent lighting conditions.  Three images were rendered  one | 
| 87 |  |  | at  a  time,  with  all 10 SPARC-10 machines working on each | 
| 88 |  |  | image simultaneously.  The wall time required to render  one | 
| 89 |  |  | image  was about 4.3 hours.  The first machine finished with | 
| 90 |  |  | all it could do shortly  after  the  last  image  piece  was | 
| 91 |  |  | assigned  at 2.8 hours.  Thus, many of the processors in our | 
| 92 |  |  | test run were done before the  entire  image  was  complete. | 
| 93 |  |  | This  is  a  problem  of  not  breaking the image into small | 
| 94 |  |  | enough pieces for efficient processor allocation. | 
| 95 |  |  |  | 
| 96 |  |  | For the time that the processors were running, all  but  one | 
| 97 |  |  | had  98%  or 99% CPU utilization.  The one exception was the | 
| 98 |  |  | file server, which had 94% CPU utilization.  This means that | 
| 99 |  |  | the processors were well saturated while working on our job, | 
| 100 |  |  | not waiting for image piece assignments, disk  access,  etc. | 
| 101 |  |  |  | 
| 102 |  |  | If  we  include the time at the end when some processors had | 
| 103 |  |  | finished while others were still going,  the  effective  CPU | 
| 104 |  |  | utilization  averaged  84%,  with the lowest at 75%.  Again, | 
| 105 |  |  | this low figure was due to the fact that the picture  should | 
| 106 |  |  | have been divided into more than the 49 pieces we specified. | 
| 107 |  |  | (The overall utilization was really better than this,  since | 
| 108 |  |  | we  set  the  jobs  up to run one after the other and once a | 
| 109 |  |  | processor finished its part on one image it went on to  work | 
| 110 |  |  | on the next image.) | 
| 111 |  |  |  | 
| 112 |  |  | The real proof of a parallel implementation is not CPU util- | 
| 113 |  |  | ization,  however,  it  is  the  speedup factor.  To examine | 
| 114 |  |  | this, it was necessary to start the job over, running  on  a | 
| 115 |  |  | single processor.  Running alone, one SPARC-10 took about 35 | 
| 116 |  |  | hours to finish an image, with 99% CPU utilization.  That is | 
| 117 |  |  | about  8.2  times  as  long as the total time required by 10 | 
| 118 |  |  | processors to finish the image (due mostly to  idle  proces- | 
| 119 |  |  | sors  at the end).  This ratio, 8.2/10, is very close to the | 
| 120 |  |  | average effective CPU utilization value of  84%,  indicating | 
| 121 |  |  | that  parallel processing does not result in a lot of redun- | 
| 122 |  |  | dant calculation. | 
| 123 |  |  |  | 
| 124 |  |  | Our experience showed that  an  incompletely  parallelizable | 
| 125 |  |  | problem  could  be solved efficiently on distributed proces- | 
| 126 |  |  | sors using NFS as a data sharing mechanism.   The  principle | 
| 127 |  |  | lesson  we  learned from this exercise is that good utiliza- | 
| 128 |  |  | tion of multiple processors requires that the job be  broken | 
| 129 |  |  | into  small  enough  chunks.  It is perhaps significant that | 
| 130 |  |  | the time spent idle, 16%, corresponds roughly to the percen- | 
| 131 |  |  | tage of the total time required by a processor to finish one | 
| 132 |  |  | piece (since there were about 5 chunks for each  processor). | 
| 133 |  |  | If  we  were to decrease the size of the pieces so that each | 
| 134 |  |  | processor got 20 pieces on average,  we  should  expect  the | 
| 135 |  |  | idle time to go down to around 5%. |