CEP and LTA computing facilities


The data coming from the Correlator are written to a cluster of computer nodes called CEP4. CEP4 is designed for the exclusive use of the Radio Observatory to process the data through the standard data reduction (flagging and averaging of the visibilities) pipelines, while another cluster, CEP3, is available for use by the commissioners and other users to manually analyse their data. An extensive description of CEP4 and CEP3 is given in the following sections. 

 


CEP4

The Lofar phase 4 cluster (CEP4) is adopted to store the raw observation data and process them through the standard data pipelines. Processed data products are made available to the user via the Long-Term Archive, but may exceptionally also be copied to the CEP3 cluster upon request for further analysis by the user in the original proposal. Due to the intensive nature of the standard data pipelines and the need for these compute resources to be allocated and scheduled by Radio Observatory staff. Access to the resources on CEP4 is strictly limited to the Radio Observatory. In the following, a short description of the computing characteristics/performances of the new cluster is given.

The Lofar Phase 4 cluster consists of

  • 50 compute nodes (called cpu01..cpu50)
  • 4 GPU nodes (gpu01..gpu04)
  • 18 storage nodes (data01..data18)
  • 2 meta-data nodes (meta01..meta02)
  • 2 head nodes (head01..head02)
  • 1 management node (mgmt01)

Each node is reachable as XXXX.cep4.control.lofar. Users are only allowed on head01 and head02.
Each compute node consists of:

  • CPU: Intel Xeon E5-2680v3 2.5 GHz (12 cores, HyperThreading disabled)
  • Memory: 256GB @ 2133 MHz
  • Disk: 2x 300GB 10Krpm SAS RAID
  • Network: 2x 1GbE, 2x 10GbE, 1x FDR InfiniBand

Each GPU node consists of:

  • CPU: Intel Xeon E5-2630v3 2.4 GHz (8 cores, HyperThreading disabled)
  • Memory: 320GB @ 1866 MHz
  • Disk: 2x 300GB 10Krpm SAS RAID + 2x 6TB 7.2Krpm SAS RAID
  • Network: 2x 1GbE, 2x 10GbE, 1x FDR InifiniBand

Each head node consists of:

  • CPU: Intel Xeon E5-2603v3 1.6 GHz (6 cores, HyperThreading disabled)
  • Memory: 128GB @ 1600 MHz
  • Disk: 2x 300GB 10Krpm SAS RAID + 2x 6TB 7.2Krpm SAS RAID
  • Network: 2x 1GbE, 2x 10GbE, 1x FDR InifiniBand

The other nodes are not accessible (storage, meta-data, and management nodes).


Storage: The storage and meta-data nodes provide a ~2PB LustreFS global filesystem through the InfiniBand network to all nodes in /data, thus implying that all nodes see the same data.

Processing: CEP4 uses a SLURM batch scheduling system to schedule and run all observation and processing jobs on the cluster.

 

CEP3

The CEP3 cluster allows for running science processing close to the CEP4 facility. The CEP3 cluster consists of 24 Dell PowerEdge R720 servers:

  • 20 compute nodes with 22TB of capacity each (called lofxxx)
  • 2 "lhd" head nodes (called lhdxxx; lhd002 is the main head node for users)
  • 2 test nodes (lof021 and lof022)

Each CEP3 node consists of:

  • 20 Cores (2 ten-core Intel Xeon e5 2660v2 processors)
  • 128 GB memory
  • 8 x 4TB disks in RAID6 setup, 22TB netto diskspace, XFS filesystem
  • 2 x 10Gbps Ethernet interface
  • Dell PERC H710P RAID controller

None of the servers are equipped with GPU boards at this point but in the future the servers can be fitted with up to two GPU cards (e.g. NVIDIA K20X).

Two head nodes (lhd001 & lhd002) are available for logging in and (limited) interactive development and processing purposes. The other 22 servers are called lof0<xx>, with <xx ranging from 001 to 022, which are planned for user access and processing. Access to the twenty worker nodes is managed through a job management system. Users are required to submit requests for processing jobs/sessions on the worker nodes. In general, data will be distributed across the local disks on the worker nodes and processing jobs are distributed accordingly. One server (lof014) is reserved for observatory use, two servers (lof005, lof007, lof008, lof019 and lof022) are reserved for MSSS re-processing, one server (lof009) is reserved for commissioning.

 

Observing, CEP4 processing time and the use of CEP3 are allocated by the LOFAR Programme Committee and the ILT director during the regular proposal evaluation stages, or under Director's Discretionary Time. Therefore, users who need to use CEP3 to process their data should request CEP3 resources in the proposals.


Access and use of CEP3 is under the sole control of the Radio Observatory's Science Operations and Support group (SOS). Access for Users will be granted only at the discretion of the SOS group and Users should conform to the access, resource allocation and data deletion policies issued by the SOS group at all times.

To have access to CEP3 a formal request must be submitted to the RO helpdesk which is hosted at https://support.astron.nl/rohelpdesk or be explicitly given in a proposal for LOFAR observing time. When submitting a request users should clearly include the following information:

- A brief explanation of why access to CEP3 is required (e.g., you do not have access to suitable computing resources elsewhere)
- Project to be worked on (i.e. commissioning, cycle or archived data (post) processing)
- Description of the kind of processing for which CEP3 access is requested
- List of collaborators (if any) who should also have access
- General description of the data to be processed (e.g. data size)
- Estimated processing time required

Users awarded with access to CEP3 will be able to access the cluster for a limited period of time (8 weeks by default). At the beginning of a Cycle, users can derive this timeline by checking the observing schedule, which is available here. Access timelines related to observing programs involving observations spread in time will be discussed between the PI and Science Operations and Support. After the granted period on CEP3 has expired, all user's data products generated on the cluster will be automatically and promptly removed, to enable new users to have enough disk space to perform their data reduction.

 

Extension to the default 8-weeks period are granted only in *exceptional* circumstances and if properly justified by submitting a request through the RO helpdesk which is hosted at https://support.astron.nl/rohelpdesk before the end of the granted period. Also, the Radio Observatory has started a monitoring campaign of node usage during the allocated default time. The evaluation of future requests will also be based on such statistics.

 

Additional ILT-related computing resources

LOFAR users who are looking for computational resources for the purpose of analyzing LOFAR data can consider applying for this at one of the compute centers that participate in the hosting of the LOFAR Long Term Archive. Application for Dutch grid resources, including the grid clusters at SURFsara and at the RUG, can be made via the following site which also provides references for other compute clusters hosted by SURFsara:

https://e-infra.surfsara.nl/

For the compute clusters hosted by the ForschungsZentrum Jülich, applications can be made via:

http://www.fz-juelich.de/ias/jsc/EN/Expertise/Services/JSConline/Computi...

Each compute center has its own application and allocation process as well as rules that applicants must comply to. These are separate from the application and assessment process for ILT projects. To prevent conflicts between allocations for observing proposals and for compute resources at one of the partner institutes, users should provide details of their intentions in their proposals. Advance notice to the Observatory of an application for computational resources at one of the above mentioned institutes will help the assessment process for both. It may be considered to include ASTRON staff in an application for computational resources as affiliation to a Dutch research institute is required e.g. when applying for Dutch GRID resources.

Design: Kuenst.    Development: Dripl.    © 2020 ASTRON