[Radiance-general] sharing indirect values for parallel processing?

Fri Feb 4 17:12:53 CET 2005

Hi all interested,

Cross posting to dev as this is probably a more appropriate space for 
this conversation.

OK, let's see who I can irritate the most...

As a refresher, there have been numerous threads on this topic on 
Radiance Dev (in no order other than my searching through my mail):

    * Before we give up on lock files...
    * multiprocessor systems, Radiance and you
    * as well as others if you want to delve in to the depths of the pre
      radiance-online mailing list archives

In general, I recall that there are a couple of directions to go:

    * network filesystem locking - such as NFS or Samba, where we are
      dependant on either the locking mechanism actually working (eg
      NFS) or the filesystem (Samba) being installed
    * client/server - probably more hairy from a implementation
      standpoint as well as from a porting point of view. Although,
      perhaps guaranteeing the best performance for selected os'?

Not to rehash old stuff, but could one of the more knowledgeable 
developers (Greg, Georg, Peter, Carsten...?) give us a refresher on what 
the options are and perhaps some idea of the time that would be needed 
to implement a workings solution? Locking is a recurring problem. It 
would be nice to figure a consensus solution (ie what direction to 
pursue) and then a strategy for implementation (ie resources, person(s), 
money...), so perhaps we as a community could figure out how to move 
this forward (if as always there is enough interest).

I must admit to having run into this wall on a variety of occasions. NFS 
(v3) on linux is "supposed" to lock correctly (sync mode on the 
mount/fstab), as a test there is a test suite from Sun 
(www.connectathon.org) that is supposed to test the nfs server. I 
remember running this test suite in the past and getting positive 
results on linux. Nevertheless, I have found it extremely difficult to 
get working results with a networked image render (eg rpiece distributed 
over multiple cpu nodes). Either there end up being problems with 
ambient values between image cells and/or with locking of the syncfile 
for distributing image cells to different machines. I even implemented a 
client/server in perl at one point to try to fight this problem with the 
syncfile (with partial success as I recall and perhaps more if my time 
would allow). Not to cause offense... But is it possible that the 
locking code in Radiance needs to be checked itself?

In brief follow-up to Lar's comments about openmosix/mosix. As 
understand it the msf filesystem, is supposed to implement locking 
correctly. There are also other more sophisticated network filesystems 
such as GFS (Systina, I think and commercial), OpenGFS and many others. 
However these all require separate special install and perhaps 
modification of the kernel or installation of a modified kernel, and 
there is serious question as too whether these are portable to other 
os's such as MS version whatever (as the main offender of portability).

Note also that I tried openmosix at one point. One problem that I found 
is that if you start multiple large (eg memory size) jobs on the master 
node then this can lead to excessive paging and since the master node 
tries to start the jobs at the same time into its own memory space prior 
to migrating them off to other nodes in the cluster. So if your job 
requires 1 Gig of memory to hold the scene and you want to run 10 jobs 
on 5 dual processor nodes with each node having 2 Gig of memory, if you 
start all the jobs on one node then you are hosed. If you start them on 
individual nodes, then you should be using a different clustering 
solution since this completely negates the value of the migration 
algorithms in openmosix. Now it has been a while since I used OpenMosix, 
so perhaps things are different...

Note also that named pipes do not work (at least back in mid 2003, you 
can see my brief inquiry to the openmosix list and Moshe Bar's even 
briefer reply back in April of 2003) on OpenMosix. So if you want to do 
memory sharing on multiprocessor nodes you have to roll your own batch 
job distributor.

-Jack de Valpine

Georg Mischler wrote:

>Lars O. Grobe wrote:
>
>  
>
>>>The most straightforward solution to our problem would probably
>>>be to use lock files, as Greg suggested in earlier discussions.
>>>Unfortunately nobody has found the time yet to actually implement
>>>that. If anyone wants to volunteer, please move the discussion of
>>>your proposal to the dev-list.
>>>      
>>>
>>Hi,
>>
>>as I won't be able to help on the implementation, I won't bring this to
>>the dev-list for now ;-) However, I guess the only needed feature of
>>the shared fs used is a working byte range locking, right? So I will
>>find out if the fs provided by openmosix (mfs) has this feature, which
>>would make a set of mosix nodes a great radiance installation.
>>    
>>
>
>Ambient files are only written at the end, so file locking
>and byte range locking have the same effect.
>We also need a solution that works on all platforms and on
>all file systems. Requiring third party software just to get
>reliable file sharing is clearly out of the question.
>
>
>-schorsch
>
>  
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://radiance-online.org/pipermail/radiance-general/attachments/20050204/11a788c3/attachment.htm