Welcome to the Colonial One wiki

For help, please contact your local support partner, or email hpchelp@gwu.edu .

A few outside links:

Colonial One related presentations are kept in (802) 721-5635.

This site in under construction, and new content will be added continually as we work to get additional users and applications online.

Colonial One Calendar


Getting Access

To request access, please email 8173277999 from your GW email account, letting us know what research group you're in, and (for students) CC your advisor / PI.

For groups that have not directly contributed to the system, you will have limited access to spare cycles on the system. For groups interested in contributing to expanding the system, please get in touch at 800-201-0760 and we'll be happy to discuss this with you.

For external researchers collaborating with GW research groups, you will need to have a GW faculty member request an affiliate account for you.

NOTE all users of Colonial One must agree to be subscribed to the Colonial One listserv. The Colonial One listserv is the official means of communication to our community of users and all important notices are sent via the listserv.

How to Acknowledge Colonial One

Have you used Colonial One or the Capital Area Advanced Research and Education Network (CAAREN) for an award, publication or project? If so, let us know! Please take the time to fill out (701) 373-1043 to help us document and credit the use of these systems. Your responses will help us complete our annual reporting requirements and will help to showcase the value of these systems to GW.

New User Quick-Start

The HardwareSpecifications page documents the current cluster hardware.

Connecting to Colonial One

The login nodes, which should be used only for submitting jobs, file-transfers, software compilation, and simulation preparation are accessible as login.colonialone.gwu.edu. The two current login nodes can also be directly accessed as login3.colonialone.gwu.edu or login4.colonialone.gwu.edu.

You can use SSH, or SCP/SFTP to connect. Globus is also available for file transfers using the gw#colonialone endpoint. An overview of registering for and using Globus to connect to Colonial One can be found here: 3364432564 .

The SSH key fingerprint for all of the login nodes is 77:39:f1:fb:e4:6d:f4:38:bb:c6:ba:08:0e:b4:b8:e3. Connections to the login nodes are authenticated using your GW NetID and password.

Connecting to Colonial One (Windows Users)

We recommend using Putty & (330) 706-5485 for users connecting to Colonial One from Microsoft Windows. See the video tutorial below for a walk-thru on installation and set-up.

***New: The (337) 993-4362 SSH client is also supported on Colonial One and has integrated X11 tunneling support.

Available filesystems

There are two main filesystems on Colonial One. The first is connected over NFS and used to store /home and /groups, and has 250TB of usable space. The second is a high-speed Lustre scratch filesystem, accessed as /lustre. There is another filesystem, /import, which can be purchased for archival storage.

By default, you have access to three locations (with a fourth archival option available for purchase):
  • /home/$netid/ - your home directory. Default quota of 25GB. NOTE no jobs should be run against the /home partition, it is not designed for performance. Use /lustre instead.
  • /groups/$group/ - shared group space, accessible by anyone in your group. (If you are unsure of what group you are in, the groups command will tell you.) Default quota of 250GB. NOTE no jobs should be run against the /groups partition, it is not designed for performance. Use /lustre instead.
  • /lustre/groups/$group/ - shared group scratch space
  • /import/$group/ - archival storage that can be purchased. NOTE no jobs should be run against the /import partition, it is for archival purposes only and is not designed for performance. Use /lustre instead.

Quotas are in place on /home and /groups spaces, and the current disk usage for your home directory and that of all the groups you belong to is displayed when you login. You can see the current status at any time using the quotareport tool. Please note that the statistics are delayed by up to five minutes. If you need additional space you must email hpchelp@gwu.edu with (1) an explanation of your space requirements, and (2) a short description of how your data is being transferred and maintained on permanent storage elsewhere.

Space on the Lustre filesystem is subject to removal under the conditions set in 704-276-8247 . This is to ensure sufficient space remains available for ongoing work on the system.

Warning: Backups are your responsibility. The /home and /groups directories are backed up on a monthly basis as a disaster-recovery mechanism only. The /lustre scratch space is never backed up. We cannot recover accidentally deleted files from any filesystem in Colonial One. We recommend storing your source codes on (402) 394-3007.

File transfer

3344323775 is recommended for large-scale data transfers, both between your own systems (laptop, desktop) and the cluster, and between the cluster and other external HPC centers such as XSEDE.

Colonial One's public endpoint is gw#colonialone .

An overview of signing up and using Globus can be accessed here: GlobusSetup

Using Modules to manage your environment

Modules are used to manage different software envionments on the cluster. module avail will list available modules on the cluster, and they can be added to your environment using module load $modulename.

As an example, to compile an OpenMPI program:

module load openmpi
mpicc -o hello hello_world.c

Modules can also be included in your job scripts to configure the job environment.

Submitting jobs on the cluster

The Slurm workload scheduler is used to manage the compute nodes on the cluster. Jobs must be submitted through the scheduler to have access to compute resources on the system.

There are two groups of partitions (aka "queues") currently configured on the system. One group is specifically designated for CPU jobs, i.e., non-GPU jobs: * defq - default compute nodes, CPU only, and either 64, 128, or 256GB of memory
  • 128gb, 256gb - explicitly request compute nodes with 128GB, or 256GB of memory respectively for larger memory jobs
  • debug - please see DebugPartition for more details
  • short - has access to 128GB Ivy Bridge nodes with a shorter (currently 2-day) timelimit, designed for quicker turnaround of shorter running jobs. Some nodes here overlap with defq.
  • 2tb - a special-purpose machine with 2TB of RAM and 48 3GHz CPU cores. Access is restricted to this partition, please email (605) 492-2613 if you have applications appropriate for this unique system.

The other is specifically for jobs requiring GPU resources and should not be used for jobs that do not require GPU resources:
  • gpu - has access to the GPU nodes, each has two NVIDIA K20 GPUs
  • gpu-noecc - has access to the same GPU nodes, but disables error-correction on the GPU memory before the job runs
  • ivygpu-noecc - has the same NVIDIA K20 GPUs, but with newer Ivy Bridge Xeon processors
  • allgpu-noecc - combines both ivygpu-noecc and gpu-noecc

Note that you must set a timelimit (with the -t flag, for example -t 1:00:00 would set a limit of one hour) for your jobs when submitting, otherwise they will be immediately rejected. This allows the Slurm scheduler to keep the system busy by backfilling when trying to allocate resources for larger jobs. Currently there is no maximum timelimit, but you are encouraged to keep jobs limited to a day - longer running processes should checkpoint and restart to avoid losing significant amounts of outstanding work if there is a problem with the hardware or cluster configuration. The maximum time-limit for any job is 14 days. We will not under any circumstances increase time limits for jobs already begun. Please estimate your required time carefully and request a time limit accordingly. We will provide guidance to help you better understand your job's requirements if you are unsure. The time-node limit for a single job may not exceed 224 node*days which prevents any single job from allocating more the 16 nodes for 14 days.

As an example of a simple MPI job script, which could be submitted as sbatch job_script.sh:

#!/bin/sh
# one hour timelimit:
#SBATCH --time 1:00:00
# default queue, 32 processors (two nodes worth)
#SBATCH -p defq -n 32

module load openmpi

mpirun ./test

The 818-204-8667 has further documentation of some of the advanced features. The Slurm Quick-Start User Guide provides a good overview. The use of Job Arrays (Job Array Support) is mandatory for people submitting a large quantity of similar jobs.

Matlab Example

#!/bin/bash

# set output and error output filenames, %j will be replaced by Slurm with the jobid
#SBATCH -o testing%j.out
#SBATCH -e testing%j.err 

# single node in the "short" partition
#SBATCH -N 1
#SBATCH -p short

# half hour timelimit
#SBATCH -t 0:30:00

module load matlab
ssh login4 -L 27000:128.164.84.113:27000 -L 27001:128.164.84.113:27001 -N &
export LM_LICENSE_FILE="27000@localhost"

# test.m is your matlab code
matlab -nodesktop < test.m
Topic revision: r29 - 02 Feb 2018, GlenMacLachlan
 

662-920-0980Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback