How do I start up large number of nodes quickly?
[this answer should be reworked and sanitized]
From: "Martijn de Vries" <email@example.com>
Subject: [Bright Cluster Manager Support #6472] scaling question
Date: Wed, 21 May 2014 01:55:43 +0200
<URL: http://support.brightcomputing.com/rt/Ticket/Display.html?id=64725 >
On Wed May 21 00:38:19 2014, firstname.lastname@example.org wrote:
> While we haven't yet requested a demo license for this yet, we are
> in seeing how this might scale on a system with 288 nodes to get us
> ideas at least of how something more like 10,000 nodes might function.
Ok, let us know when you want to try it. If you have 288 physical nodes
to play with, you could also try to simulate a larger cluster by
virtualizing the nodes. We did this ourselves in the past a couple of
times and went up to about 8000.
> My reading of the bright guide and some playing I've done show that
> basically designate some hosts as provisioning nodes. It appears the
> sync method is rsync. The manual suggests only 10 nodes be installed
Actually it's a little more subtle. You can boot any number of nodes
simultaneously, but the 'slots' parameter in the provisioning role
determines how many nodes are provisioned concurrently from a particular
provisioning node. The remaining nodes will wait until a provisioning
slot becomes available on one of the provisioning nodes.
You can set the number of slots on a provisioning node as high as you
like. For example, if you set it to 10000, then all of your 10000 nodes
will be provisioned simultaneously (if you manage to power them on all
of them at the same time). As you can probably imagine, this will not
give you the best performance as there will be lots of context switching
going on. 10 is a pretty decent number for provisioning over 1 GigE, but
if you have InfiniBand or 10/40 GigE you can probably set it a little
higher, for example 50 or 100, although it may not make much of a
> Is there a system design where we could get let's say 288 nodes fully
> and booted in 10 minutes or so? What settings are used for an approach
> this? What sort of network topology is suggested?
Yes that's possible. We have customers that do full provisioning in 10m
of 512 nodes. They PXE boot and provisioning over IB and I believe they
have 1 or 2 extra provisioning nodes on top of the two head nodes
(configured as an HA pair).
Factors that matter:
1) Interconnect that is used for provisioning (e.g. GigE, 10 GigE, IB)
2) Number of provisioning nodes
3) Disk throughput (on your provisioning nodes as well as on your
And if you really want to optimize provisioning, you can play around
with the number of provisioning slots per provisioning node.
Note that in order to provision over IB, you don't necessarily need to
PXE boot over IB. You can PXE boot over GigE and then use IB just for
Also keep in mind that "full" provisioning (where the entire software
image is provisioned to the node from scratch) is normally only done
when you are setting up the cluster, or if you are switching nodes to a
different software image. On subsequent boots "sync" provisioning is
used, which typically completes within a couple of seconds per node
(provided that there have not been major changes to the software image).
We did some experiments ourselves a little over a year ago on a
(non-virtualized) 4000 node cluster with IB in various configurations:
1) full provisioning to local disk on the compute nodes
2) full provisioning to tmpfs
3) root over Lustre
4) root over NFS
Going from complete power off to everything powered (so including BIOS
POST, PXE, etc) on took between ~20m and ~40m depending on the
configuration. I can probably find the exact numbers if I dig in my
mailbox (let me know).
> Maybe the best thing would be if you have a white paper or something
> describes the setup/tuning for a larger deployment.
> If you have such a document, could you please pass it along?
We don't have such a document, but it sounds like an interesting topic.
Unfortunately being a software company, we don't have large numbers of
nodes sitting around, and we also don't have the datacenter
infrastructure to run them. We'd be happy to work with you on your 288
node cluster and show you how to tune provisioning performance. We can
publish the results in a whitepaper so that it's available to anyone who
> At this stage, it could be that the document is enough for what I need
> rather than experimenting.
> Thank you!