Please review the policies for the system, this will assist in finding the right level of service for your project and choosing job parameters correctly. Overall we want to provide the best quality of service to all users so please consider how we may improve the service you receive and give us your feedback (see the contact details on the HPC web site).
The Research Computing Service aims to help deliver cutting edge research using innovative technology within the broad area of data-centric HPC. It is responsible for the hosting, system support, scientific support and service delivery of several large supercomputing and research data storage resources for the University of Cambridge research community.
The University-wide service is run as a self-sustaining cost centre and therefore must recover all costs incurred by the capital depreciation and running costs of the computer equipment, plus additional scientific support costs incurred to help increase the useful scientific output achieved from the machine. To this end the computational equipment within the Service is run as a Major Research Facility under the Full Economic Costing (fEC) funding model. Under this model, units of use are priced and research staff should determine how many units they require for a particular project, explicitly include these costs within a grant application for the project, and then pass this funding back to the HPC Service as a direct cost.
As a result of this funding requirement, the HPC Service must have a clearly stated and controlled usage policy which results in well-defined and guaranteed service level agreements (SLAs). This is achieved by use of the SLURM resource allocation software and the implementation of a detailed resource allocation policy.
We request that you acknowledge use of the CSD3 Tier 2 system in all publications and presentations which use any results generated through your use of the CSD3 clusters and that you send us copies of publications (or provide links) on request. The following acknowledgement can be used in papers:
This work was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (www.csd3.cam.ac.uk), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/P020259/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk).
Internal University of Cambridge users please refer to the charges page. External users should please contact us directly.
All costs are covered including capital depreciation of the machine, all running costs including power, management staff and support staff. The aim is for the Research Computing Service to develop a core in-house knowledge base in HPC systems and scientific support which will be available to all research staff within the University.
The service allows a range of service levels with different features. These are associated with different quality of service definitions within the scheduler, enforcing different job priorities, maximum job sizes and maximum run times.
A service level is attached to a project, which is a group of users led by a principal investigator (PI) who controls (or who can apply for) a line of funding for HPC.
The current Service Levels (SLs), which came into operation in November 2017 and which apply to the Peta4 and Wilkes2 clusters, are described below.
Funding units are in the form of usage credits. What these credits represent depends on the type of cluster. For the CPU cluster (Peta4-Skylake) 1 credit = 1 CPU core for 1 hour. For the KNL cluster (Peta4-KNL) 1 credit = 1 KNL node for 1 hour. For the GPU cluster (Wilkes2-GPU) 1 credit = 1 GPU for 1 hour.
Accounting periods or quarters are three month periods of the year running 1st February - 30th April, 1st May - 31st July, 1st August - 31st October and 1st November - 31st January.
Service Level 1 – Guaranteed Usage¶
Service Level 1 (SL1) operates with the highest quality of service QOS1, designed for groups which require large and steady amounts of computer time over a long period.
Funds paid will be converted into core hour credits. These credits will be divided over an agreed time period and allocated to quarterly (three month) accounting periods; thus the number of core hours used per quarter is defined at the start of the usage agreement.
Furthermore the quarterly allocation of core hours is set as a minimum usage guarantee. Thus SL1 users running consistent workload throughout a quarter will be guaranteed to be able to use their quarterly allocation.
SL1 users are able to use more than their allotted allocation within a quarter on a best efforts basis, by either using old credits or transferring credits from a future allocation quarter.
Old credits from previous quarters are (usually) made available automatically to a project once it has exhausted its credits within the current quarter, in the manner of credits assigned under SL2 (i.e. no guaranteed rate of usage).
The transfer of credits from a future quarter should be arranged directly with the support personnel.
Once both normal credits and expired credits have been exhausted, further jobs submitted can be handled under the terms of SL4 (Residual Usage). It should be noted that SL4 is the lowest service level on the system and SL4 jobs will only run when there are no eligible jobs in the queue, i.e. when the system is not fully occupied with other jobs. SL4 is designed to help keep the system fully occupied at times of low usage.
Service Level 2 – Ad Hoc Usage¶
Service Level 2 (SL2) is the same as SL1 except that there is no preallocation of credits into specific quarters, and no predefined minimum quarterly usage level. Instead credits are created at the same cost and are available for use until exhausted. This service level has the highest quality of service (QOS1), and is designed for groups which require smaller and irregular amounts of computer time.
When SL2 users exhaust their credits they can continue to submit jobs at Service Level 3 until more credits are purchased.
For current costs and details of how to purchase, internal University of Cambridge users please refer to the charges page.
Service Level 3 – Free Usage¶
Service Level 3 (SL3) operates with the medium quality of service, QOS2. QOS2 is lower than QOS1 which is used in SL1 & SL2. This service level is designed for groups with medium usage requirements who currently do not have funding to pay for their usage, thus an immediate conversion of funds into credits is not required.
SL3 is capped with a maximum number of usage credits per PI per quarter. This has been introduced to promote a more even usage of the free time on the system. Currently each PI may receive 200,000 CPU core hours, 8000 GPU hours and 1000 KNL node hours per quarter.
There is no guaranteed minimum usage level for SL3 and there is no concept of expiry or of moving core hours across quarters.
Once a group in SL3 has consumed all its allowed core hours in a quarter, jobs can still be submitted at Service Level 4.
Service Level 4 – Residual Usage¶
Service Level 4 (SL4) operates with the lowest quality of service, QOS3. QOS3 is the lowest quality of service in operation on the cluster. This is the default service level for users who are not eligible for higher service levels.
SL4 jobs only run when there are no other eligible jobs in the queue. Users relying on SL4 to run a job should expect long wait times.
SL4 is designed to make use of unused compute cycles by allowing SL1, SL2 and SL3 users who have reached their limits to make use of unutilised compute cycles.
Quality of Service¶
QOS1 – highest quality of service
- QOS1 jobs have the highest priority and will move through the queue fastest.
- QOS1 jobs have a maximum job run time of 36 hours.
- QOS1 jobs have a maximum per-job limit on the number of resource units that a single job can take:
- Peta4-Skylake jobs have a 1280 CPU core limit
- Peta4-KNL jobs have a 128 node limit
- Wilkes2-GPU jobs have a 64 GPU limit.
The same limits apply, on each cluster separately, to QOS1 jobs overall belonging to the same user and project.
QOS2 – medium quality of service
- QOS2 jobs have a lower priority than QOS1 and will move through the queue more slowly than QOS1 jobs.
- QOS2 jobs have a maximum job run time of 12 hours.
- QOS2 jobs have a maximum per-job limit on the number of resource units that a single job can take:
- Peta4-Skylake jobs have a 320 CPU core limit
- Peta4-KNL jobs have a 64 node limit
- Wilkes2-GPU jobs have a 32 GPU limit.
The same limits apply, on each cluster separately, to QOS2 jobs overall belonging to the same user and project.
QOS3 – lowest quality of service
- QOS3 jobs have the lowest priority and jobs will only run when there are no eligible QOS1 and QOS2 jobs in the queue.
- QOS3 jobs have a maximum job run time of 12 hours.
There is a global limit, on each cluster separately, on the total number of resource units that may be represented by running jobs at QOS3 (across all users and projects).
Research Computing Services are specific research-oriented IT services provided by the University and as such are covered by the University IT Facilities and Services Privacy Notice. The basis for the use of personal data is consent, explicitly given at the time of user account application and implicitly upon each connection to Research Computing Services as per the banner warning if present.
This local policy document explains in more detail what information is held about individual people (Research Computing Service account holders) by Research Computing Service systems, how it is gathered and how it is used. Details of the data held or logged are given below. This information is used to support user access to the resources of the Research Computing Service systems, to enable communication with you about the status of the system and your use of it as required, for system administration and bug tracking, for the detection of improper use, and for producing usage statistics for management and planning purposes.
Access to these logs and to user-specific data is restricted to appropriate staff or contractors of the Research Computing Service, and in the specific case of DiRAC and Tier2 users, to the appropriate staff at EPCC responsible for resource allocation and user administration of DiRAC and Tier2 service through the SAFE system. Please note that SAFE is not part of the University of Cambridge and all DiRAC and Tier2 users should refer to the EPCC Privacy Statement.
These logs are currently held indefinitely subject to the availability of storage space, but might not be recovered as a result of an accidental or deliberate removal action.
Summary statistics are extracted from this data. Some of these may be made publicly available, but those that are do not include the identity of individuals. DiRAC and Tier2 users (only) should note that their individual job records are uploaded to SAFE nightly.
Relevant subsets of this data may be passed to computer security teams (e.g. Cambridge CERT) as part of investigations of specific incidents of computer misuse involving Research Computing Service systems.
In the event that suspicious activity is detected on the CUDN, data held as described in the University IT Facilities and Services Privacy Notice may be passed to Research Computing Service management for investigation.
Data pertaining to particular projects may also on occasion be passed to the appropriate people (e.g. Principal Investigators or nominated deputies) responsible for direction and management of those projects. Otherwise the information is not passed to any third party except where required by law.
Data is stored on disk storage systems and may be backed up to tape at some frequency depending on the filesystem. These backups are made to enable reinstatement of the data, e.g. in the event of failure of a system component, or accidental deletion. Details of backup and other policies applicable per filesystem are available on the filesystem page. User data, log data and backups are at all times physically held in secure University premises, or transferred over the CUDN using strong SSH-based encryption.
Any user of the Research Computing Service systems who approaches the Service Desk or any staff within the Research Computing Service for help with a problem, implicitly grants permission to the Research Computing Service staff to investigate that problem by looking at data held on the system and files in their home directories or other personal or group storage areas.
Accounting and other user-dependent system data¶
The Research Computing Service management servers hold details of user accounts, thereby enabling a user to log in and use the resources of the Research Computing Service systems.
The following data are collected via either the account application process or service usage and held and maintained for each user:
- User identifier (account name)
- Institution affiliation
- Project affiliation
- Email address
- Contact telephone number
- User administration history
- Login history (session begin/end times and originating IP address)
- Resource consumption (in the form of job records accumulated by the job scheduler)
- Use of licensed applications (in the course of ensuring license term compliance).
These data are held on the Research Computing Service management systems from the time the user’s account is created, whether or not the user ever makes use of the Research Computing Service systems.
Service specific data remain stored subject to storage capacity until purged as obsolete; basic user information (names, system identifiers and institutional affiliations) regarding University of Cambridge users is duplicated from central user administration records, see the University IT Facilities and Services Privacy Notice. Names, system identifiers and affiliations pertaining to external users are stored indefinitely in order that historical usage of research computing systems can be properly attributed.
Other data held¶
Research data held in home directories or other personal or group storage areas is stored, as required for the fulfillment of Research Computing Service services. This data is stored until purged by the user, or by the Research Computing Service to enforce advertised policy, or automatically as obsolete in the case of tape re-use.
In addition applications, including but not limited to login shells, may record command history in files contained in the user’s home directory. Such files will survive until purged by the user, or by the Research Computing Service to enforce advertised policy, or automatically as obsolete in the case of tape re-use.
From time to time we may gather publication data from external journal or preprint listings in order to assess research outputs facilitated by research computing services.
For further information, please refer to the University IT Facilities and Services Privacy Notice and https://www.information-compliance.admin.cam.ac.uk/data-protection/general-data.