discuss: Thread: Beowulf howto


[<<] [<] Page 1 of 1 [>] [>>]
Subject: Beowulf howto
From: Kurt ####@####.####
Date: 12 May 2004 22:06:07 -0000
Message-Id: <3.0.6.32.20040512170604.009bf230@pop3.spacestar.net>

Earlier this year I submitted a proposed beowulf howto, and you expressed
concerns about security. 

I have revised it, so here it is:

Wolf Up by Kurt Swendson

0. Introduction
This document describes step by step instructions on building a Beowulf
cluster.   After seeing all of the documentation that was available, I felt
there were enough gaps and omissions that my own document that accurately
describes how to build a Beowulf cluster would be beneficial.
	
I first saw Thomas Sterling's article in Scientific American, and
immediately got the book, because its title was "How to Build a Beowulf".
No doubt - it was a valuable reference, but it really does not walk you
through instructions on exactly what to do. 

So what follows is a description of what I got to work.  It is only one
example - my example.  You may choose a different message passing
interface; you may choose a different Linux distribution.   You may also
spend as much time as I did researching and experimenting, and learn on
your own.

1.  Definitions and requirements
1.1 Definitions
What's the difference between a true Beowulf cluster and a COW [cluster of
workstations]?  

Brahma makes a good definition:
http://www.phy.duke.edu/brahma/beowulf_book/node62.html  through
node6n.html.  

What we will be doing here in the beginning is actually a cow, but after
you get used to the details, you will be able to do the automated install,
which will change your cow into a wolf.

If you are a "user" at your organization, and you have the use of some
nodes, you may still do the instructions shown here to create a cow.  But
if you "own" the nodes, that is, if you have complete control of them, and
are able to completely erase and rebuild them, you may create a true
Beowulf cluster.  

If you went to the above web page, you will see that author suggests that
you manually configure each box, and then, later on, after you get the feel
of doing this whole "wolfing up" procedure, you can set up new nodes
automatically, which I will describe in a later document.
1.2 Requirements
Let's briefly outline your requirements: 

1. More than one box, each equipped with a network card.
2. A switch or hub to connect them.
3. Linux 
4. A message passing interface. [I used lam]

It is not a requirement to have a kvm switch, but merely a TWO port kvm
switch is convenient while setting up and / or debugging.  

2.  Set up the head node
So let's get wolfing.   Choose the most powerful box to be the head node.
Install Linux on there, and choose every package you want.  The only
requirement is that you [in RH speak] choose "Network Servers" because you
need to have NFS and ssh.   That's all you need.  But in my case, I was
going to do development of the Beowulf application, so I added X and C
development.

Those of you researching Beowulf systems will also know how you can have a
2nd network card on the head node so you can access it from the outside
world.   This is not required for the operation of a cluster, so you may do
what you want regarding this. 

I learned the hard way: use a password that obeys the strong password
constraints for your Linux distribution.  I used an easily-typed password
like "a" for my user, and the whole thing did not work.  When I changed my
password to a legal password, with mixed numbers, characters, upper and
lower case, it worked.  

If you use lam as your message passing interface, you will read in the
manual to turn OFF the firewalls, because they use random port numbers to
communicate between nodes.  Here is a rule:  If the manual tells you to do
something, DO IT!   The lam manual also tells you to run as a non-root
user. Make the same user for every box. Build every box on the cluster with
the same user "wolf" with the same password.  
2.1  Hosts
First we will modify hosts. You will see the comments telling you to leave
the "localhost" line alone.   I blatantly ignored that advice and fixed it
to not include my hostname as the loopback address.   

The line used to say:  127.0.0.1  wolf00 localhost.localdomain localhost

It now says: 127.0.0.1 localhost.localdomain localhost

Then I added all the boxes on my network.  Note: This is not required for
the operation of a Beowulf cluster; only convenient for me, so that I may
type a simple "wolf01" instead of 192.168.0.101:

192.168.0.100	wolf00
192.168.0.101	wolf01
192.168.0.102	wolf02
192.168.0.103	wolf03

2.2  Groups
In order to responsibly set your cluster up, especially if you are a "user"
of your boxes [see Definitions and Requirements], you should have some
measure of security.   After you create your user, create a group, and add
the user to the group.  Then, you may modify your files and directories to
only be accessible by the users within that group:

groupadd beowulf
usermod -g beowulf wolf

… and add the following to .bash_profile:

umask 007

Now any files created by the user "wolf" [or any user within the group]
will automatically be only readable by the group "beowulf".

2.3  NFS
Refer to the following web site:
http://www.ibiblio.org/mdw/HOWTO/NFS-HOWTO/server.html#CONFIG
Print that up, and have it at your side.  I will be directing you how to
modify your system in order to create an NFS server, but I have found this
site invaluable, as you may also. 

Make a directory for everybody to share:

mkdir /mnt/wolf
chmod 770 /mnt/wolf
chown wolf:beowulf /mnt/wolf -R

Go to the /etc directory, and add your "shared" directory to the exports file:

cd /etc
cat exports
cat >> exports
/mnt/wolf	192.168.0.100/192.168.0.255 (rw)
<control d>

2.4  IP addresses
My network is 192.168.0.nnn because it is one of the "private" IP ranges.
Thomas Sterling talks about it on page 106 of his book.  It is inside my
firewall, and works just fine.  My head node, which I call "wolf00" is
192.168.0.100, and every other node is named "wolfnn", with an ip of
192.168.0.100 + nn.  I am following the sage advice of many of the web
pages out there, and setting myself up for an easier task of scaling up my
cluster. 

2.5  Services
Make sure that services that we want are up: 

chkconfig -add  sshd
chkconfig -add  telnet
chkconfig -add  nfs
chkconfig -add  rexec
chkconfig -add  rlogin
chkconfig -level 3 rsh on
chkconfig -level 3 telnet on
chkconfig -level 3 nfs on
chkconfig -level 3 rexec on
chkconfig -level 3 rlogin on

Telnet?  I added this just as a convenience.  It is not needed but it is
nice to have while debugging your nfs stuff.  How are you going to log into
a box if you can't ssh to the box?  Here is the only reason why I used the
kvm switch.  It is useful for going back and forth between the head node
and the node I am currently setting up.  

…And, during startup, I saw some services that I know I don't want, and in
my opinion, could be removed:

chkconfig -del atd
chkconfig -del rsh
chkconfig -del sendmail

2.6 SSH
To be responsible, we make ssh work.  While logged in as root, you must
modify the /etc/ssh/sshd_config file.   The lines:

#RSAAuthentication yes
#AuthorizedKeysFile .ssh/authorized_keys

… are commented out, so uncomment them [remove the #].   

Reboot, and log back in as wolf, because the operation of your cluster will
always be done from this user.  Also, the hosts file modifications done
earlier must take effect.  To generate your public and private SSH keys, do
this:

ssh-keygen -b 1024 -f ~/.ssh/id_rsa -t rsa -N ""

… and it will display a few messages, and tell you that it created the
public / private key pair.  You will see these files, id_rsa and
id_rsa.pub, in the /home/wolf/.ssh directory.  

Copy the id_rsa.pub file into a file called "authorized_keys" right there
in the .ssh directory.  We will be using this file later. 

Modify the security on the files, and the whole directory: 

chmod 644 ~/.ssh/auth*
chmod 755 ~/.ssh

According to the LAM user group, only the head node needs to log on to the
slave nodes; not the other way around.  Therefore when we copy the public
key files, we only copy the head node's key file to each slave node, and
set up the agent on the head node.   This is MUCH easier than copying all
nodes to all nodes.  I will describe this in more detail later.  

Note:  I only am documenting what the LAM distribution of the message
passing interface requires; if you chose another MPI to build your cluster
your requirements may differ.  

At the end of /home/wolf/.bash_profile, add the following statements [again
this is lam-specific; your requirements may vary]:

export LAMRSH='ssh -x'
sh -c 'ssh-add && bash'

2.7 MPI
Lastly, put your message passing interface on the box.  You can either
build it using the supplied source, or use their precompiled package.   It
is not in the scope of this document to describe that - I just got the
source and followed the directions, and in another experiment I installed
their rpm; both of them worked fine.  Remember the whole reason we are
doing this is to learn - go forth and learn. 


3.  Set up slave nodes
Get your network cables out.    Install Linux on the first non-head node.
Going with my example node names and IP addresses, this is what I chose
during setup:

Workstation
auto partition
remove all partitions on system
use LILO as the boot loader
put boot loader on the MBR
host name wolf01
ip address 192.168.0.101
add the user "wolf" with the same password as on all other nodes
NO firewall
ONLY package installed:  network servers.  UN select all other packages.

It doesn't matter what else you choose; this is the minimum of what you
need.   Why fill the box up with non-essential software you will never use?
  My research has been concentrated on finding that minimal configuration
to get up and running.   

Here's another very important point.  When you move on to an automated
install and config, you really will NEVER log in to the box.  Only during
setup and install do I type anything directly on the box.   

When the computer starts up, it will complain if it does not have a
keyboard connected.  I was not able to modify the BIOS, because I had older
discarded boxes with no documentation, so I just connected a "fake"
keyboard.  I am in the computer industry, and see hundreds of keyboards
come and go, and some occasionally end up in the garbage.    I get the old
dead keyboard out of the garbage, remove JUST the cord with the tiny
circuit board up there in the corner, where the num lock and caps lock
lights are.  Then I plug the cord in, and the computer thinks it has a
complete keyboard without incident.  Again, you would be better off
modifying your bios, if you are able to.  This is just a trick to use in
the case that you don't have the bios program.
 
After your newly installed box reboots, log on as root again, and…

1.	do the same chkconfig commands stated above to set up the right services.
2.	modify hosts;  remove "wolfnn" from localhost, and just add wolfnn and
wolf00. 
3.	install lam
4.	create the /mnt/wolf directory and set up security for it.
5.	do the ssh configuration 

Up to this point, we are pretty much the same as the head node.   I do NOT
do the modification of the exports file.  Also, this line is NOT added to
the .bash_profile:  

sh -c 'ssh-add && bash'

Recall that on the head node, we created a file "authorized_keys".  Copy
that file, created on your head node, to the ~/.ssh directory on the slave
nodes.  The HEAD node will log on the all the SLAVE nodes.  The
requirement, as stated in the LAM user manual, is that there should be no
interaction required when logging in from the head to any of the slaves.
So, copying the public key from the head node into each slave node, in the
file "authorized_keys", tells each slave that "wolf user on wolf00 is
allowed to log on here without any password or anything; we know it is safe." 

However you may recall that the documentation states that the first time
you log on, it will give you some dialog, and ask for confirmation.   So
only once, after doing the above configuration, go back to the head node,
and type

ssh wolfnn

where "wolfnn" is the name of your newly configured slave node. It will ask
you for confirmation, and you simply answer "yes" to it, and that will be
the last time you will have to interact.  Prove it by logging back off, and
then ssh back to that node, and it should just immediately log you in, with
no dialog whatsoever.  

More configuration for the slave nodes:

cat >> /etc/fstab
wolf00:/mnt/wolf    /mnt/wolf     nfs     rw,hard,intr    0 0
<control d>

Then I modify /etc/lilo.conf.  The 2nd line of this file says timeout=nn
Modify that line to say "timeout=1200".   After it is modified, as root,
say /sbin/lilo, and it will make the changes take effect.   It will say
"Added linux *".

Why do I do this lilo modification?  If you were researching Beowulf on the
web, and understand everything I have done so far, you would wonder, "I
don't remember reading anything about lilo.conf."  

My Beowulf cluster all sits on a single power strip.   I turn on the power
strip, and every box on the cluster starts up immediately.  As the startup
procedure progresses, it mounts file systems.   Seeing that the non-head
nodes mount the shared directory from the head node, they all will have to
wait a little bit until the head node is up, with NFS ready to go.   So, I
make each non-head node wait 2 minutes in the lilo step.  Meanwhile, the
head node is coming up, and making the shared directory available.  By
then, the non-head nodes finally start booting up because lilo has waited 2
minutes. 

4.  Verification
All done!   You are almost ready to start wolfing. Reboot your boxes.   Did
they all come up?  Can you ping the head node from each box? Can you ping
each node from the head node?   Can you telnet?  Can you ssh?   Don't worry
about doing ssh as root; only as wolf.   If you are logged in as wolf, and
ssh to a box, does it go automatically, without prompting for password?  

After the node boots up, log in as wolf, and say "mount".   Does it show
wolf00:/mnt/wolf mounted?  On the head node, copy a file into /mnt/wolf.
Can you read and write that file from the node box?   

This is really not required; it is merely convenient to have a common
directory reside on the head node.  With a common shared directory, you can
easily use scp to copy files between boxes. Sterling states in his book, on
page 119, a single NFS server causes a serious obstacle to scaling up to
large numbers of nodes.   I learned this when I went from a small number of
boxes up to a large number.  

5.  Run a program
Once you can do all the tests shown above, you should be able to run a
program.  From here on in, the instructions are lam specific. Go back to
the head node, log in as wolf, and:

cat > /mnt/wolf/lamhosts
wolf00
wolf01
wolf02
wolf03
wolf04
<control d>

Go to the lam examples directory, and compile "hello.c":

mpicc -o hello hello.c
cp hello /mnt/wolf

Then, as shown in the lam documentation, start up lam: 

[wolf@wolf00 wolf]$ lamboot -v lamhosts 
LAM 7.0/MPI 2 C++/ROMIO - Indiana University
n0<2572> ssi:boot:base:linear: booting n0 (wolf00) 
n0<2572> ssi:boot:base:linear: booting n1 (wolf01) 
n0<2572> ssi:boot:base:linear: booting n2 (wolf02) 
n0<2572> ssi:boot:base:linear: booting n3 (wolf04) 
n0<2572> ssi:boot:base:linear: finished 

So we are now finally ready to run an app.   [Remember, I am using lam;
your message passing interface may have different syntax].   

[wolf@wolf00 wolf]$ mpirun n0-3 /mnt/wolf/hello 
Hello, world! I am 0 of 4 
Hello, world! I am 3 of 4 
Hello, world! I am 2 of 4 
Hello, world! I am 1 of 4 

[wolf@wolf00 wolf]$

Recall I mentioned the use of NFS above.  I am telling the nodes to all use
the nfs shared directory, which will bottleneck when using a larger number
of boxes.  You could very easily copy the executable to each box, and in
the mpirun command, specify node local directories:  mpirun n0-3
/home/wolf/hello.  The prerequisite for this is to have all the files
available locally.  In fact I have done this, and it worked better than
using the nfs shared executable. Of course this theory breaks down if my
cluster application needs to modify a file shared across the cluster.  To
this, I say, "Do 'man autofs' and see how it says 'The documentation leaves
much to be desired.'" Then you will know what I mean.

Subject: Re: Beowulf howto
From: "Ruth A. Kramer" ####@####.####
Date: 13 May 2004 12:42:02 -0000
Message-Id: <40A2F9D0.7DB0@fast.net>

Kurt wrote:
> Earlier this year I submitted a proposed beowulf howto, and you expressed
> concerns about security.
> 
> I have revised it, so here it is:
> 
> Wolf Up by Kurt Swendson

I found this interesting to read (convenient being right there in the
email) and thought I'd offer a few comments / questions as if this were
a real maillist thread.  (Which, I guess it is.)

Aside: I know it was a different post and may have been on a different
mail list, but somebody recently posted another long email (I think it
was a HOWTO also, so maybe the tldp list).  They "assumed" that it would
not cause anybody any problems and asked if anybody still uses dialup
anyway.  I think many people still use dialup, and, dialup or not, in
some countries I understand that you pay to download (and upload) email
on a per (kilo/mega/whatever) byte basis.  So I would suggest that we be
considerate.  (BTW, I do not consider this email overly long (compared
to a bunch of others I get with HTML or large attached files or pictures
(spam)).)  (Parenthesis included at no extra charge.)

I found the HOWTO interesting first of all for some of the tidbits of
configuration suggestions / examples that I think would be useful
whether you are setting up a Beowulf cluster or not (how to generate an
ssh key, how to set up NFS, don't use an "illegal" password, etc.), so I
will maintain a link to the HOWTO for those tidbits.

It may be out of your intended scope for the document, but I didn't find
any guidance on why someone might want a Beowulf cluster.  Treating this
like an email thread, I'd like to discuss that a little bit here, even
if you choose not to add anything to the HOWTO.

I have several computers set up on my local home network (old and cheap)
and several more I could add.  Some are workstations for my son (Linux /
Windows) and wife (Windows) and one is a dos box serving as my Internet
gateway.  (I had a box running a Linux mailserver and a local TWiki (on
Apache), but the hard disk crashed and I haven't rekindled the
enthusiasm necessary to restore these.)  Then I have a workstation for
myself (700 Mhz. Duron), a workstation with my CD burner on it (running
Windows :-(.  I plan to add some sort of backup server (to collect
backups of all the machines, data and or/OS images depending on my
perception of the difficulty of doing a reinstall.  (I deeply regret not
having a backup image of the mailserver.)

Then I occasionally set up another machine to try out a new distro
before upgrading my or my son's workstations.

Some questions:

(How) could a Beowulf cluster help me?

   * Would a Beowulf cluster automatically take tasks from my
workstation and run them elsewhere on the cluster to improve the speed,
or is the effect of a Beowulf cluster pretty much like if I ran some
applications on machines other than my workstation but interfaced with
them via the X server on my workstation?  (Is that called "remote X"?)

   * Some machines I would not try to incorporate in any cluster,
notably my wife and son's workstations and the Internet gateway (running
on DOS, which I'm hoping provides a little bit of diversity to my
network and is thus perhaps less easy to hack -- is anybody trying to
hack IPROUTE (Mischler) running on dos? (rhetorical question)).

   * Any other benefits to running a Beowulf cluster?

regards,
Randy Kramer






> 
> 0. Introduction
> This document describes step by step instructions on building a Beowulf
> cluster.   After seeing all of the documentation that was available, I felt
> there were enough gaps and omissions that my own document that accurately
> describes how to build a Beowulf cluster would be beneficial.
> 
> I first saw Thomas Sterling's article in Scientific American, and
> immediately got the book, because its title was "How to Build a Beowulf".
> No doubt - it was a valuable reference, but it really does not walk you
> through instructions on exactly what to do.
> 
> So what follows is a description of what I got to work.  It is only one
> example - my example.  You may choose a different message passing
> interface; you may choose a different Linux distribution.   You may also
> spend as much time as I did researching and experimenting, and learn on
> your own.
> 
> 1.  Definitions and requirements
> 1.1 Definitions
> What's the difference between a true Beowulf cluster and a COW [cluster of
> workstations]?
> 
> Brahma makes a good definition:
> http://www.phy.duke.edu/brahma/beowulf_book/node62.html  through
> node6n.html.
> 
> What we will be doing here in the beginning is actually a cow, but after
> you get used to the details, you will be able to do the automated install,
> which will change your cow into a wolf.
> 
> If you are a "user" at your organization, and you have the use of some
> nodes, you may still do the instructions shown here to create a cow.  But
> if you "own" the nodes, that is, if you have complete control of them, and
> are able to completely erase and rebuild them, you may create a true
> Beowulf cluster.
> 
> If you went to the above web page, you will see that author suggests that
> you manually configure each box, and then, later on, after you get the feel
> of doing this whole "wolfing up" procedure, you can set up new nodes
> automatically, which I will describe in a later document.
> 1.2 Requirements
> Let's briefly outline your requirements:
> 
> 1. More than one box, each equipped with a network card.
> 2. A switch or hub to connect them.
> 3. Linux
> 4. A message passing interface. [I used lam]
> 
> It is not a requirement to have a kvm switch, but merely a TWO port kvm
> switch is convenient while setting up and / or debugging.
> 
> 2.  Set up the head node
> So let's get wolfing.   Choose the most powerful box to be the head node.
> Install Linux on there, and choose every package you want.  The only
> requirement is that you [in RH speak] choose "Network Servers" because you
> need to have NFS and ssh.   That's all you need.  But in my case, I was
> going to do development of the Beowulf application, so I added X and C
> development.
> 
> Those of you researching Beowulf systems will also know how you can have a
> 2nd network card on the head node so you can access it from the outside
> world.   This is not required for the operation of a cluster, so you may do
> what you want regarding this.
> 
> I learned the hard way: use a password that obeys the strong password
> constraints for your Linux distribution.  I used an easily-typed password
> like "a" for my user, and the whole thing did not work.  When I changed my
> password to a legal password, with mixed numbers, characters, upper and
> lower case, it worked.
> 
> If you use lam as your message passing interface, you will read in the
> manual to turn OFF the firewalls, because they use random port numbers to
> communicate between nodes.  Here is a rule:  If the manual tells you to do
> something, DO IT!   The lam manual also tells you to run as a non-root
> user. Make the same user for every box. Build every box on the cluster with
> the same user "wolf" with the same password.
> 2.1  Hosts
> First we will modify hosts. You will see the comments telling you to leave
> the "localhost" line alone.   I blatantly ignored that advice and fixed it
> to not include my hostname as the loopback address.
> 
> The line used to say:  127.0.0.1  wolf00 localhost.localdomain localhost
> 
> It now says: 127.0.0.1 localhost.localdomain localhost
> 
> Then I added all the boxes on my network.  Note: This is not required for
> the operation of a Beowulf cluster; only convenient for me, so that I may
> type a simple "wolf01" instead of 192.168.0.101:
> 
> 192.168.0.100   wolf00
> 192.168.0.101   wolf01
> 192.168.0.102   wolf02
> 192.168.0.103   wolf03
> 
> 2.2  Groups
> In order to responsibly set your cluster up, especially if you are a "user"
> of your boxes [see Definitions and Requirements], you should have some
> measure of security.   After you create your user, create a group, and add
> the user to the group.  Then, you may modify your files and directories to
> only be accessible by the users within that group:
> 
> groupadd beowulf
> usermod -g beowulf wolf
> 
> … and add the following to .bash_profile:
> 
> umask 007
> 
> Now any files created by the user "wolf" [or any user within the group]
> will automatically be only readable by the group "beowulf".
> 
> 2.3  NFS
> Refer to the following web site:
> http://www.ibiblio.org/mdw/HOWTO/NFS-HOWTO/server.html#CONFIG
> Print that up, and have it at your side.  I will be directing you how to
> modify your system in order to create an NFS server, but I have found this
> site invaluable, as you may also.
> 
> Make a directory for everybody to share:
> 
> mkdir /mnt/wolf
> chmod 770 /mnt/wolf
> chown wolf:beowulf /mnt/wolf -R
> 
> Go to the /etc directory, and add your "shared" directory to the exports file:
> 
> cd /etc
> cat exports
> cat >> exports
> /mnt/wolf       192.168.0.100/192.168.0.255 (rw)
> <control d>
> 
> 2.4  IP addresses
> My network is 192.168.0.nnn because it is one of the "private" IP ranges.
> Thomas Sterling talks about it on page 106 of his book.  It is inside my
> firewall, and works just fine.  My head node, which I call "wolf00" is
> 192.168.0.100, and every other node is named "wolfnn", with an ip of
> 192.168.0.100 + nn.  I am following the sage advice of many of the web
> pages out there, and setting myself up for an easier task of scaling up my
> cluster.
> 
> 2.5  Services
> Make sure that services that we want are up:
> 
> chkconfig -add  sshd
> chkconfig -add  telnet
> chkconfig -add  nfs
> chkconfig -add  rexec
> chkconfig -add  rlogin
> chkconfig -level 3 rsh on
> chkconfig -level 3 telnet on
> chkconfig -level 3 nfs on
> chkconfig -level 3 rexec on
> chkconfig -level 3 rlogin on
> 
> Telnet?  I added this just as a convenience.  It is not needed but it is
> nice to have while debugging your nfs stuff.  How are you going to log into
> a box if you can't ssh to the box?  Here is the only reason why I used the
> kvm switch.  It is useful for going back and forth between the head node
> and the node I am currently setting up.
> 
> …And, during startup, I saw some services that I know I don't want, and in
> my opinion, could be removed:
> 
> chkconfig -del atd
> chkconfig -del rsh
> chkconfig -del sendmail
> 
> 2.6 SSH
> To be responsible, we make ssh work.  While logged in as root, you must
> modify the /etc/ssh/sshd_config file.   The lines:
> 
> #RSAAuthentication yes
> #AuthorizedKeysFile .ssh/authorized_keys
> 
> … are commented out, so uncomment them [remove the #].
> 
> Reboot, and log back in as wolf, because the operation of your cluster will
> always be done from this user.  Also, the hosts file modifications done
> earlier must take effect.  To generate your public and private SSH keys, do
> this:
> 
> ssh-keygen -b 1024 -f ~/.ssh/id_rsa -t rsa -N ""
> 
> … and it will display a few messages, and tell you that it created the
> public / private key pair.  You will see these files, id_rsa and
> id_rsa.pub, in the /home/wolf/.ssh directory.
> 
> Copy the id_rsa.pub file into a file called "authorized_keys" right there
> in the .ssh directory.  We will be using this file later.
> 
> Modify the security on the files, and the whole directory:
> 
> chmod 644 ~/.ssh/auth*
> chmod 755 ~/.ssh
> 
> According to the LAM user group, only the head node needs to log on to the
> slave nodes; not the other way around.  Therefore when we copy the public
> key files, we only copy the head node's key file to each slave node, and
> set up the agent on the head node.   This is MUCH easier than copying all
> nodes to all nodes.  I will describe this in more detail later.
> 
> Note:  I only am documenting what the LAM distribution of the message
> passing interface requires; if you chose another MPI to build your cluster
> your requirements may differ.
> 
> At the end of /home/wolf/.bash_profile, add the following statements [again
> this is lam-specific; your requirements may vary]:
> 
> export LAMRSH='ssh -x'
> sh -c 'ssh-add && bash'
> 
> 2.7 MPI
> Lastly, put your message passing interface on the box.  You can either
> build it using the supplied source, or use their precompiled package.   It
> is not in the scope of this document to describe that - I just got the
> source and followed the directions, and in another experiment I installed
> their rpm; both of them worked fine.  Remember the whole reason we are
> doing this is to learn - go forth and learn.
> 
> 3.  Set up slave nodes
> Get your network cables out.    Install Linux on the first non-head node.
> Going with my example node names and IP addresses, this is what I chose
> during setup:
> 
> Workstation
> auto partition
> remove all partitions on system
> use LILO as the boot loader
> put boot loader on the MBR
> host name wolf01
> ip address 192.168.0.101
> add the user "wolf" with the same password as on all other nodes
> NO firewall
> ONLY package installed:  network servers.  UN select all other packages.
> 
> It doesn't matter what else you choose; this is the minimum of what you
> need.   Why fill the box up with non-essential software you will never use?
>   My research has been concentrated on finding that minimal configuration
> to get up and running.
> 
> Here's another very important point.  When you move on to an automated
> install and config, you really will NEVER log in to the box.  Only during
> setup and install do I type anything directly on the box.
> 
> When the computer starts up, it will complain if it does not have a
> keyboard connected.  I was not able to modify the BIOS, because I had older
> discarded boxes with no documentation, so I just connected a "fake"
> keyboard.  I am in the computer industry, and see hundreds of keyboards
> come and go, and some occasionally end up in the garbage.    I get the old
> dead keyboard out of the garbage, remove JUST the cord with the tiny
> circuit board up there in the corner, where the num lock and caps lock
> lights are.  Then I plug the cord in, and the computer thinks it has a
> complete keyboard without incident.  Again, you would be better off
> modifying your bios, if you are able to.  This is just a trick to use in
> the case that you don't have the bios program.
> 
> After your newly installed box reboots, log on as root again, and…
> 
> 1.      do the same chkconfig commands stated above to set up the right services.
> 2.      modify hosts;  remove "wolfnn" from localhost, and just add wolfnn and
> wolf00.
> 3.      install lam
> 4.      create the /mnt/wolf directory and set up security for it.
> 5.      do the ssh configuration
> 
> Up to this point, we are pretty much the same as the head node.   I do NOT
> do the modification of the exports file.  Also, this line is NOT added to
> the .bash_profile:
> 
> sh -c 'ssh-add && bash'
> 
> Recall that on the head node, we created a file "authorized_keys".  Copy
> that file, created on your head node, to the ~/.ssh directory on the slave
> nodes.  The HEAD node will log on the all the SLAVE nodes.  The
> requirement, as stated in the LAM user manual, is that there should be no
> interaction required when logging in from the head to any of the slaves.
> So, copying the public key from the head node into each slave node, in the
> file "authorized_keys", tells each slave that "wolf user on wolf00 is
> allowed to log on here without any password or anything; we know it is safe."
> 
> However you may recall that the documentation states that the first time
> you log on, it will give you some dialog, and ask for confirmation.   So
> only once, after doing the above configuration, go back to the head node,
> and type
> 
> ssh wolfnn
> 
> where "wolfnn" is the name of your newly configured slave node. It will ask
> you for confirmation, and you simply answer "yes" to it, and that will be
> the last time you will have to interact.  Prove it by logging back off, and
> then ssh back to that node, and it should just immediately log you in, with
> no dialog whatsoever.
> 
> More configuration for the slave nodes:
> 
> cat >> /etc/fstab
> wolf00:/mnt/wolf    /mnt/wolf     nfs     rw,hard,intr    0 0
> <control d>
> 
> Then I modify /etc/lilo.conf.  The 2nd line of this file says timeout=nn
> Modify that line to say "timeout=1200".   After it is modified, as root,
> say /sbin/lilo, and it will make the changes take effect.   It will say
> "Added linux *".
> 
> Why do I do this lilo modification?  If you were researching Beowulf on the
> web, and understand everything I have done so far, you would wonder, "I
> don't remember reading anything about lilo.conf."
> 
> My Beowulf cluster all sits on a single power strip.   I turn on the power
> strip, and every box on the cluster starts up immediately.  As the startup
> procedure progresses, it mounts file systems.   Seeing that the non-head
> nodes mount the shared directory from the head node, they all will have to
> wait a little bit until the head node is up, with NFS ready to go.   So, I
> make each non-head node wait 2 minutes in the lilo step.  Meanwhile, the
> head node is coming up, and making the shared directory available.  By
> then, the non-head nodes finally start booting up because lilo has waited 2
> minutes.
> 
> 4.  Verification
> All done!   You are almost ready to start wolfing. Reboot your boxes.   Did
> they all come up?  Can you ping the head node from each box? Can you ping
> each node from the head node?   Can you telnet?  Can you ssh?   Don't worry
> about doing ssh as root; only as wolf.   If you are logged in as wolf, and
> ssh to a box, does it go automatically, without prompting for password?
> 
> After the node boots up, log in as wolf, and say "mount".   Does it show
> wolf00:/mnt/wolf mounted?  On the head node, copy a file into /mnt/wolf.
> Can you read and write that file from the node box?
> 
> This is really not required; it is merely convenient to have a common
> directory reside on the head node.  With a common shared directory, you can
> easily use scp to copy files between boxes. Sterling states in his book, on
> page 119, a single NFS server causes a serious obstacle to scaling up to
> large numbers of nodes.   I learned this when I went from a small number of
> boxes up to a large number.
> 
> 5.  Run a program
> Once you can do all the tests shown above, you should be able to run a
> program.  From here on in, the instructions are lam specific. Go back to
> the head node, log in as wolf, and:
> 
> cat > /mnt/wolf/lamhosts
> wolf00
> wolf01
> wolf02
> wolf03
> wolf04
> <control d>
> 
> Go to the lam examples directory, and compile "hello.c":
> 
> mpicc -o hello hello.c
> cp hello /mnt/wolf
> 
> Then, as shown in the lam documentation, start up lam:
> 
> [wolf@wolf00 wolf]$ lamboot -v lamhosts
> LAM 7.0/MPI 2 C++/ROMIO - Indiana University
> n0<2572> ssi:boot:base:linear: booting n0 (wolf00)
> n0<2572> ssi:boot:base:linear: booting n1 (wolf01)
> n0<2572> ssi:boot:base:linear: booting n2 (wolf02)
> n0<2572> ssi:boot:base:linear: booting n3 (wolf04)
> n0<2572> ssi:boot:base:linear: finished
> 
> So we are now finally ready to run an app.   [Remember, I am using lam;
> your message passing interface may have different syntax].
> 
> [wolf@wolf00 wolf]$ mpirun n0-3 /mnt/wolf/hello
> Hello, world! I am 0 of 4
> Hello, world! I am 3 of 4
> Hello, world! I am 2 of 4
> Hello, world! I am 1 of 4
> 
> [wolf@wolf00 wolf]$
> 
> Recall I mentioned the use of NFS above.  I am telling the nodes to all use
> the nfs shared directory, which will bottleneck when using a larger number
> of boxes.  You could very easily copy the executable to each box, and in
> the mpirun command, specify node local directories:  mpirun n0-3
> /home/wolf/hello.  The prerequisite for this is to have all the files
> available locally.  In fact I have done this, and it worked better than
> using the nfs shared executable. Of course this theory breaks down if my
> cluster application needs to modify a file shared across the cluster.  To
> this, I say, "Do 'man autofs' and see how it says 'The documentation leaves
> much to be desired.'" Then you will know what I mean.
> 
> ______________________
> http://lists.tldp.org/
Subject: Re: Beowulf howto
From: "Ruth A. Kramer" ####@####.####
Date: 13 May 2004 12:59:36 -0000
Message-Id: <40A2FDF5.6472@fast.net>

Oops(es):

   * I meant to snip most of the original email just to cut down on
bandwidth (especially after raising it as a concern)  -- I'll do that
now before I further embarrass myself by forgetting again.

   * And, I've now read some of the referenced book on Beowulf
clustering and have answered most of my own questions.

_I think I'll go back to bed._ ;-)

Randy Kramer

Randy Kramer wrote:
> 
> Kurt wrote:
> > Earlier this year I submitted a proposed beowulf howto, and you expressed
> > concerns about security.

---< SNIP SNIP SNIP !!>---
Subject: Re: Beowulf howto
From: "s. keeling" ####@####.####
Date: 13 May 2004 15:29:30 -0000
Message-Id: <20040513152910.GB5067@infidel.spots.ab.ca>

Incoming from Ruth A. Kramer:
> Kurt wrote:
> > Earlier this year I submitted a proposed beowulf howto, and you expressed
> > concerns about security.
> 
> was a HOWTO also, so maybe the tldp list).  They "assumed" that it would
> not cause anybody any problems and asked if anybody still uses dialup

fwiw, lots of people do still use dialup (including me).  However,
considering estimates are %50 - %80 of network traffic nowadays is
Spam, I expect people must be getting used to wasting half their
investment in handling cruft.

There are large chunks in the world who are still effectively without
net access; lots of Africa remains un-wired.  Then there's politics,
dictatorial regimes, and plain censorship that continue to cut off
others.

> It may be out of your intended scope for the document, but I didn't find
> any guidance on why someone might want a Beowulf cluster.  Treating this

I think that would be a valid comment to make.  Every scientific paper
starts out with a paragraph or two summarizing the contents of the
rest of the paper.  HOWTOs should be no different.

> I have several computers set up on my local home network (old and cheap)

You may benefit from NIS/NFS, networked machines, and all the other
things a LAN can do for you.  You can be sitting at the smallest, most
anaemic box on the LAN, then start up some huge application on your
big Duron but direct its results/display back to the tiny box you're
sitting at.  That's the kind of thing X Window was designed for.

>    * Would a Beowulf cluster automatically take tasks from my
> workstation and run them elsewhere on the cluster to improve the speed,

That's plain old LAN + X Window.

>    * Any other benefits to running a Beowulf cluster?

The only benefit to running Beowulf is if you have _an application_
whose way of doing things lends itself to distributed processing.  If
it does large things which can be broken up into smaller, discrete
tasks, then brought together at the end of processing to produce the
result, that's Beowulf.  There's not really all that many applications
for this sort of thing.  Simulating weather patterns, engineering
modelling, materials analysis, and that sort of thing would lend
itself toward Beowulf.  On the other hand, running a Quake (game)
server would not.


-- 
Any technology distinguishable from magic is insufficiently advanced.
(*)               http://www.spots.ab.ca/~keeling 
- -
Subject: Re: Beowulf howto
From: Jesse Meyer ####@####.####
Date: 13 May 2004 15:34:04 -0000
Message-Id: <20040513153320.GA26108@pong.lan>

On Thu, 13 May 2004, Ruth A. Kramer wrote:
>                                             They "assumed" that it would
> not cause anybody any problems and asked if anybody still uses dialup
> anyway.  I think many people still use dialup, and, dialup or not, in
> some countries I understand that you pay to download (and upload) email
> on a per (kilo/mega/whatever) byte basis.  So I would suggest that we be
> considerate.  (BTW, I do not consider this email overly long (compared
> to a bunch of others I get with HTML or large attached files or pictures
> (spam)).)  (Parenthesis included at no extra charge.)

I'm still on dialup, and large emails can be a problem.

Jesse Meyer

-- 
  Want to listen to new music?
  Why don't you look at iRATE?                                   icq: 34583382
  http://irate.sourceforge.net/                      msn: ####@####.####
                                                  jabber: ####@####.####

--> -->
 
 
<type 'exceptions.IOError'>
Python 2.5.2: /usr/bin/python
Sat Jul 6 03:59:35 2024

A problem occurred in a Python script. Here is the sequence of function calls leading up to the error, in the order they occurred.

 /opt/ezmlm-browse-0.20/<string> in ()
 /opt/ezmlm-browse-0.20/main.py in main()
  424 
  425         if path is not None:
  426                 main_path(path)
  427         else:
  428                 main_form()
global main_form = <function main_form at 0x860cc6c>
 /opt/ezmlm-browse-0.20/main.py in main_form()
  378         except ImportError:
  379                 die(ctxt, "Invalid command")
  380         module.do(ctxt)
  381 
  382 def main():
module = <module 'commands.showthread' from '/opt/ezmlm-browse-0.20/commands/showthread.pyc'>, module.do = <function do at 0x86166bc>, global ctxt = {'HTTP_X_FORWARDED_SERVER': 'glitch', 'HTTP_REFE...HTTP_ACCEPT_ENCODING': 'gzip, br, zstd, deflate'}
 /opt/ezmlm-browse-0.20/commands/showthread.py in do(ctxt={'HTTP_X_FORWARDED_SERVER': 'glitch', 'HTTP_REFE...HTTP_ACCEPT_ENCODING': 'gzip, br, zstd, deflate'})
    9         ctxt.update(ezmlm.thread(ctxt[THREADID]))
   10         header(ctxt, 'Thread: ' + ctxt[SUBJECT], 'showthread')
   11         do_list(ctxt, 'msgs', ctxt[MSGSPERPAGE], ctxt[MESSAGES],
   12                         lambda:sub_showmsg(ctxt, ctxt[MSGNUM]))
   13         footer(ctxt)
global sub_showmsg = <function sub_showmsg at 0x860c1ec>, ctxt = {'HTTP_X_FORWARDED_SERVER': 'glitch', 'HTTP_REFE...HTTP_ACCEPT_ENCODING': 'gzip, br, zstd, deflate'}, global MSGNUM = 'msgnum'
 /opt/ezmlm-browse-0.20/globalfns.py in do_list(ctxt={'HTTP_X_FORWARDED_SERVER': 'glitch', 'HTTP_REFE...HTTP_ACCEPT_ENCODING': 'gzip, br, zstd, deflate'}, name='msgs', perpage=10, values=[{'author': u'Kurt', 'authorid': 'elcjapcflojflfikmaml', 'date': '12 May 2004 22:06:07 -0000', 'month': 200405, 'msgnum': 7271, 'subject': u'Beowulf howto', 'threadid': 'iolgnjifmmeijohkpmdb', 'timestamp': 1084399567.0}, {'author': u'Ruth A. Kramer', 'authorid': 'onjgeailchjkigohmmpa', 'date': '13 May 2004 12:42:02 -0000', 'month': 200405, 'msgnum': 7275, 'subject': u'Re: Beowulf howto', 'threadid': 'iolgnjifmmeijohkpmdb', 'timestamp': 1084452122.0}, {'author': u'Ruth A. Kramer', 'authorid': 'onjgeailchjkigohmmpa', 'date': '13 May 2004 12:59:36 -0000', 'month': 200405, 'msgnum': 7276, 'subject': u'Re: Beowulf howto', 'threadid': 'iolgnjifmmeijohkpmdb', 'timestamp': 1084453176.0}, {'author': u's. keeling', 'authorid': 'paiooheipemhlplelnkb', 'date': '13 May 2004 15:29:30 -0000', 'month': 200405, 'msgnum': 7277, 'subject': u'Re: Beowulf howto', 'threadid': 'iolgnjifmmeijohkpmdb', 'timestamp': 1084462170.0}, {'author': u'Jesse Meyer', 'authorid': 'ehojccfdnblcenikjhgd', 'date': '13 May 2004 15:34:04 -0000', 'month': 200405, 'msgnum': 7278, 'subject': u'Re: Beowulf howto', 'threadid': 'iolgnjifmmeijohkpmdb', 'timestamp': 1084462444.0}, {'author': u'Kurt', 'authorid': 'elcjapcflojflfikmaml', 'date': '15 May 2004 02:14:55 -0000', 'month': 200405, 'msgnum': 7288, 'subject': u'Re: Beowulf howto', 'threadid': 'iolgnjifmmeijohkpmdb', 'timestamp': 1084587295.0}, {'author': u's. keeling', 'authorid': 'paiooheipemhlplelnkb', 'date': '16 May 2004 15:15:17 -0000', 'month': 200405, 'msgnum': 7289, 'subject': u'Re: Beowulf howto', 'threadid': 'iolgnjifmmeijohkpmdb', 'timestamp': 1084720517.0}, {'author': u'Emma Jane Hogbin', 'authorid': 'ikndjmbepldekebhjjej', 'date': '17 May 2004 16:25:58 -0000', 'month': 200405, 'msgnum': 7295, 'subject': u'Re: Beowulf howto', 'threadid': 'iolgnjifmmeijohkpmdb', 'timestamp': 1084811158.0}, {'author': u'Emma Jane Hogbin', 'authorid': 'ikndjmbepldekebhjjej', 'date': '17 May 2004 16:33:35 -0000', 'month': 200405, 'msgnum': 7296, 'subject': u'Re: Beowulf howto', 'threadid': 'iolgnjifmmeijohkpmdb', 'timestamp': 1084811615.0}], peritem=<function <lambda> at 0x8616844>)
  128                 write(template % ctxt)
  129                 if peritem:
  130                         peritem()
  131                 ctxt[ROW] += 1
  132 
peritem = <function <lambda> at 0x8616844>
 /opt/ezmlm-browse-0.20/commands/showthread.py in ()
    9         ctxt.update(ezmlm.thread(ctxt[THREADID]))
   10         header(ctxt, 'Thread: ' + ctxt[SUBJECT], 'showthread')
   11         do_list(ctxt, 'msgs', ctxt[MSGSPERPAGE], ctxt[MESSAGES],
   12                         lambda:sub_showmsg(ctxt, ctxt[MSGNUM]))
   13         footer(ctxt)
global sub_showmsg = <function sub_showmsg at 0x860c1ec>, ctxt = {'HTTP_X_FORWARDED_SERVER': 'glitch', 'HTTP_REFE...HTTP_ACCEPT_ENCODING': 'gzip, br, zstd, deflate'}, global MSGNUM = 'msgnum'
 /opt/ezmlm-browse-0.20/globalfns.py in sub_showmsg(ctxt={'HTTP_X_FORWARDED_SERVER': 'glitch', 'HTTP_REFE...HTTP_ACCEPT_ENCODING': 'gzip, br, zstd, deflate'}, msgnum=7278)
  229         format_timestamp(ctxt, ctxt)
  230         write(html('msg-header') % ctxt)
  231         rec_showpart(ctxt, msg, 0)
  232         write(html('msg-footer') % ctxt)
  233         ctxt.pop()
global rec_showpart = <function rec_showpart at 0x860c1b4>, ctxt = {'HTTP_X_FORWARDED_SERVER': 'glitch', 'HTTP_REFE...HTTP_ACCEPT_ENCODING': 'gzip, br, zstd, deflate'}, msg = <email.message.Message instance at 0x866bfec>
 /opt/ezmlm-browse-0.20/globalfns.py in rec_showpart(ctxt={'HTTP_X_FORWARDED_SERVER': 'glitch', 'HTTP_REFE...HTTP_ACCEPT_ENCODING': 'gzip, br, zstd, deflate'}, part=<email.message.Message instance at 0x866bfec>, partnum=1)
  205                 else:
  206                         for p in part.get_payload():
  207                                 partnum = rec_showpart(ctxt, p, partnum+1)
  208         else:
  209                 write(html('msg-sep') % ctxt)
partnum = 1, global rec_showpart = <function rec_showpart at 0x860c1b4>, ctxt = {'HTTP_X_FORWARDED_SERVER': 'glitch', 'HTTP_REFE...HTTP_ACCEPT_ENCODING': 'gzip, br, zstd, deflate'}, p = <email.message.Message instance at 0x867804c>
 /opt/ezmlm-browse-0.20/globalfns.py in rec_showpart(ctxt={'HTTP_X_FORWARDED_SERVER': 'glitch', 'HTTP_REFE...HTTP_ACCEPT_ENCODING': 'gzip, br, zstd, deflate'}, part=<email.message.Message instance at 0x867804c>, partnum=2)
  208         else:
  209                 write(html('msg-sep') % ctxt)
  210                 sub_showpart(ctxt, part)
  211         return partnum
  212 
global sub_showpart = <function sub_showpart at 0x860c144>, ctxt = {'HTTP_X_FORWARDED_SERVER': 'glitch', 'HTTP_REFE...HTTP_ACCEPT_ENCODING': 'gzip, br, zstd, deflate'}, part = <email.message.Message instance at 0x867804c>
 /opt/ezmlm-browse-0.20/globalfns.py in sub_showpart(ctxt={'HTTP_X_FORWARDED_SERVER': 'glitch', 'HTTP_REFE...HTTP_ACCEPT_ENCODING': 'gzip, br, zstd, deflate'}, part=<email.message.Message instance at 0x867804c>)
  164         type = ctxt[TYPE] = part.get_content_type()
  165         ctxt[FILENAME] = part.get_filename()
  166         template = html('msg-' + type.replace('/', '-'))
  167         if not template:
  168                 template = html('msg-' + type[:type.find('/')])
global template = <function template at 0x8604e9c>, global html = <function html at 0x8604ed4>, type = 'application/pgp-signature', type.replace = <built-in method replace of str object at 0x866fb48>
 /opt/ezmlm-browse-0.20/globalfns.py in html(name='msg-application-pgp-signature')
   40 
   41 def html(name):
   42         return template(name + '.html')
   43 
   44 def xml(name):
global template = <function template at 0x8604e9c>, name = 'msg-application-pgp-signature'
 /opt/ezmlm-browse-0.20/globalfns.py in template(filename='msg-application-pgp-signature.html')
   31         except IOError:
   32                 if not _template_zipfile:
   33                         _template_zipfile = zipfile.ZipFile(sys.argv[0])
   34                 try:
   35                         f = _template_zipfile.open(n).read()
global _template_zipfile = None, global zipfile = <module 'zipfile' from '/usr/lib/python2.5/zipfile.pyc'>, zipfile.ZipFile = <class zipfile.ZipFile at 0x859da7c>, global sys = <module 'sys' (built-in)>, sys.argv = ['-c', '/opt/ezmlm-browse-0.20']
 /usr/lib/python2.5/zipfile.py in __init__(self=<zipfile.ZipFile instance at 0x866bc8c>, file='-c', mode='r', compression=0, allowZip64=False)
  337             self.filename = file
  338             modeDict = {'r' : 'rb', 'w': 'wb', 'a' : 'r+b'}
  339             self.fp = open(file, modeDict[mode])
  340         else:
  341             self._filePassed = 1
self = <zipfile.ZipFile instance at 0x866bc8c>, self.fp = None, builtin open = <built-in function open>, file = '-c', modeDict = {'a': 'r+b', 'r': 'rb', 'w': 'wb'}, mode = 'r'

<type 'exceptions.IOError'>: [Errno 2] No such file or directory: '-c'
      args = (2, 'No such file or directory')
      errno = 2
      filename = '-c'
      message = ''
      strerror = 'No such file or directory'