Trillian, a CRAY XE6m-200 at UNH for the simulation of space plasma and fluid flow
Trillian was acquired through a NSF Major Research Instrumentation (MRI) grant, with cost sharing from the UNH Space Science Center, Senior Vice Provost for Research, the College of Engineering and Physical Sciences (CEPS), and the Research and Instrumentation Center (RCI). Trillian was installed June 11/12, 2013 and began production about a week later. The Principal Investigator of the MRI grant is Jimmy Raeder (Physics/SSC), with Co-Investigators Ben Chandran (Physics/SSC), Greg Chini (ME), Kai Germaschewski (Physics/SSC), John Gibson (Math), Nathan Schwadron, and Bernie Vasquez (Physics/SSC).
- 132 nodes, each with 2 AMD 16 core 'Abu Dhabi' 2.4 GHz CPUs, for a total of 4,224 cores.
- 12 service nodes (boot/login/filesys) with 6-core 2.6 GHz CPUs.
- CRAY proprietary Gemini 3D torus interconnect fabric.
- CLE (Cray Linux Environment) software stack with GNU, Portland Group, and CRAY compiler/MPI suites.
- LUSTRE parallel file system with 160TB usable space.
- 4 19" cabinets (3 compute + 1 for disks).
- PBS batch system.
Trillian is administered by RCI personnel (Tom Baker and Mark Maciolek). To apply for a user account go to new user.
Documentation is currently very rudimentary:
- Primary documentation can be found on the CRAY web site: Documentation
- use 'module list', 'module avail', 'module load xxxxx' to set you software environment. see modules.sourceforge
- GCC, CRAY, INTEL, and PGI compiler suites are available, with the typical libs (mpi, hdf5).
- the batch system is PBS (qsub, qstat, qdel and friends). 'qstat -a' shows what's running/waiting
- do NOT use the login nodes for computation other than compiling, etc.
- the default PBS queue is workq.
- sample PBS script:
#! /bin/bash #PBS -l nodes=6:ppn=32 #PBS -l walltime=99:99 cd /home/space/jraeder/THM-DF-20120703-trillian-test01 export GFORTRAN_UNBUFFERED_ALL=1 aprun -n 178 ./THM-DF-20120703-trillian-test01.exe
- you need to start executables with 'aprun' even for serial codes, otherwise they run on the login node!
- there are currently no queue limits, but users should keep #tasks < 1000 for the time being.
- do NOT store big files in the home directory, use the lustre file system. Every user has a lustre directory /mnt/lustre/lus0/GROUP/USER Update: all home dirs are on Lustre now, so it does not matter.
Here is a bunch of videos and a tgz with example from Dave Strensky . Dave is also the Cray person to ask the really tough questions.
Running one-core jobs:
The smallest allocation unit on Trillian is one node, i.e., 32 cores. There is a good reason for that, because otherwise if one allowed multiple jobs on one node, these may get into each other’s way by competing for resources such as memory or I/O bandwidth.
If there is a need for running many identical one-core jobs in parallel one should write a wrapper code that uses MPI calls to run multiple instances of the same code with different parameters. This is rather trivial. https://www.nersc.gov/users/computational-systems/hopper/running-jobs/example-batch-scripts/
File access via sshfs:
Simulations on trillian will typically leave large amounts of data on the lustre file system. Since trillian is not necessarily a good platform for analysis and visualization of these data, it is often necessary to access data from workstations. This can happen in two ways: (1) move the data using ftp, scp, or rsync (the latter is the most convenient and most efficient), or (ii) provide remote access. The lustre file system on trillian cannot be NFS mounted, but it is possible to create a mount point in "user space" for which no root access is required. To do so, make sure the 'sshfs' package is installed on your machine. Then, issue from the client:
mkdir -p $HOME/trillian sshfs -o Ciphers=arcfour -o idmap=user YOUR_USER_NAME@trillian.sr.unh.edu:/mnt/lustre/lus0 $HOME/trillian df -h
You will see a new file system named 'trillian' on your machine. Of course, you can give it a different name. This way you can access the files on trillian from your client machine. Since the access is over the network it is naturally slower than direct disk access, but in many cases that is still better than having to move files around. Note that sshfs uses ssh and sftp to transfer bytes over the network, and that it encrypts everything. It does not seem possible to eliminate encryption, but arcfour is supposed to be very fast. In my tests (JR) speeds varied a lot in the range of 6-60 MB/s.
firstname.lastname@example.org [to subscribe, send an email to email@example.com]
- RCI operators: 862-4518
- [| email Tom Baker]
- [| email Mark Maciolek]
- [| email Jimmy Raeder]
- [| email Kai Germaschewski]
Narrative for proposals:
UNH operates ``Trillian", a CRAY XE6m-200 supercomputer with 4,096 compute cores, 4 TB of memory, GEMINI interconnect fabric, and 160 TB Lustre disk space. Trillian has a top speed of about 40 TeraFlops. Trillian runs the Cray Linux Environment (CLE) and its software stack consists of the GNU, Portland Group, and CRAY compilers along with the typical libraries such as MPI and HDF. Trillian batch jobs are managed using PBS. Trillian will be available for this project.
Acknowledgement for Trillian use in papers:
Computations were performed on Trillian, a Cray XE6m-200 supercomputer at UNH supported by the NSF MRI program under grant PHY-1229408.
MediaWiki has been successfully installed.
Consult the User's Guide for information on using the wiki software.