Compressing LOFAR measurement sets using Dysco

Summary

As of 10 September 2018, all LOFAR HBA data products ingested to the Long Term Archive (LTA) will be compressed using Dysco. This decision was made after evaluating the effect of visibility compression on LOFAR measurement sets (see below for more information). 

  • Our tests indicate that compressing the LBA and the HBA measurement sets with dysco do not produce any visible differences in the calibrator solutions or the recovered source properties. 

To process the Dysco compressed data, you will need to run the LOFAR software version 3.1 or later (built with the dysco library). The Dysco compression specifications (using 10 bits per float to compress visibility data) and the tests carried out as part of the commissioning effort are valid for any HBA imaging observation with a frequency resolution of at least four channels per subband and a time resolution of 1 second. Note that using 10 bits is a conservative choice and the compression noise should be negligible.

 

Need for dysco visibility compression

Modern radio interferometers like LOFAR contain a large number of baselines and record visibility data at a high time and frequency resolution resulting in significant data volumes. A typical 8-hour observing run for the LOFAR Two-metre Sky Survey (LoTSS) produces about 30~TB of preprocessed data. It is important to manage the data growth in the LTA, especially in view of the increasing observing efficiencies. One way to achieve this is to compress the recorded visibility data. Recently, Offringa (2016) proposed a new technique called Dysco to compress interferometric visibility data. The new compression technique is fast, the noise added by data compression is small (within a few per cent of the system noise in the image plane) and has the same characteristics as the normal system noise (for specific information on the compression technique, see Offringa (2016)  and the casacore storage manager available here). 

Commissioning tests

Before integrating the Dysco compression technique in the Radio Observatory production pipelines, the Radio Observatory carried out a commissioning effort to characterise how compressing visibility data using Dysco affects the calibration solutions and the images produced. 

Compressing HBA data 

To validate Dysco compression on LOFAR HBA data, we carried out a test observation using the standard LoTSS setup (2x244 subbands, 16 ch/sb, 1s time resolution). The raw visibilities were preprocessed (RFI flagging and averaging) using three different preprocessing pipelines: (i) standard production pipeline without any compression, (ii) enable dysco compression on visibility data, and (iii) enable dysco compression on both visibility data and visibility weights. The data products produced by the three pipeline runs were processed using the direction-independent Prefactor and direction-dependent Factor pipelines.

Comparing the gain solutions and the images produced by the prefactor and the factor runs show that compressing visibility data and visibility weights have little impact on the final output data products. The key results from this exercise can be summarized as follows: 

  • Compressing the measurement sets with dysco does not produce any visible differences in the calibrator gain amplitudes, clock and TEC solutions.
  • Gain solutions for a given facet derived as part of the Factor direction-dependent calibration scheme are similar for a dysco compressed and uncompressed datasets (See Fig. 1).
  • For one facet (containing the brightest facet calibrator), we found that the gain solutions for a few remote stations were different for the dysco compressed case (See Fig. 2). This is caused by the different clean-component models used during the facet selfcal step. However, since the image-domain comparisons are identical between different pipeline products, this is not a cause for concern. 
  • The mean ratio of source fluxes (see Fig. 3) between the uncompressed and the dysco compressed datasets is 1.004 +- 0.06.
  • The mean positional offset in both right ascension and declination is less than 0.08 arcsec.
  • For a typical LoTSS observation, the disk space occupied by the compressed visibility data is about a factor of 3.6 smaller than uncompressed data.
  • Since compressing and uncompressing the visibility data is faster than the typical disk read/write times, Dysco compression does not increase the computational cost of the Radio Observatory production pipelines and the processing pipelines used by the users.

 

Fig 1. Plot showing that Dysco compression has little influence on the amplitude solutions derived for a given facet as part of the Factor direction-dependent calibration pipeline. The three colours indicate the three different datasets used to derive the solutions: red indicates solutions derived from uncompressed data, black indicates solutions for a dataset where only the visibilities were compressed, and green points correspond to solutions from a dataset where both the visibility and the visibility weights were compressed.

 

Fig 2. Plot showing gain solutions for the facet containing the brightest facet calibrator. The gain solutions for dysco compressed data is different for a few remote stations due to the difference in the clean-component model used in the facet selfcal step.

 

Fig. 3. Flux ratio between catalogs of sources produced using the uncompressed and the dysco compressed datasets. The mean flux ratio is 1.004 +- 0.064. 
 

Compressing LBA data 

We used an 8-hour scan on 3C 196 to validate applying dysco compression on LBA data. The observed data were preprocessed by the radio observatory with two different pipelines (i) with dysco visibility compression enables,and (ii) without dysco compression.  Further processing was carried out by Francesco de Gasperin using the standard LBA calibrator pipeline. Comparing the intermediate data products produced by the pipeline, we find that dysco compression has no significant impact on the data products produced by the calibration pipeline. The key results from this exercise are listed below:

  • Based on visual inspection, the calibrator solutions are identical.
  • The mean ratio of source fluxes between the uncompressed and the dysco compressed datasets is 1.007.
  • The largest difference in the pixel values is at the 0.01 Jy/beam level close to the bright central source (3C 196)

 

How do I know if my data have been compressed?

Since 10 September 2018, the radio observatory has been recording all HBA imaging observations in Dysco-compressed measurement sets. A new column has been introduced in the LTA to identify if a given data product has been compressed with Dysco. When you browse through your project on the LTA, on the page displaying the correlated data products, the new column Storage Writer identifies if your data has been compressed with Dysco. For example, Fig 4 shows the list of correlated data products for an averaging pipeline. The column Storage Writer specifies that the preprocessed data products have all been stored using the DyscoStorageManager implying that the data has been compressed with Dysco. 

To process these data you will need to run the LOFAR software version 3.1 or later (built with the dysco library) so that DPPP can automatically recognise the way the visibilities have been recorded. Note that compressing already dysco-compressed visibility data will add noise to your data and hence should be avoided. 

For further questions/comments, please contact the LOFAR Science Operations & Support using our JIRA helpdesk. 

 


Fig 4. A new column called Storage Writer has been introduced in the LTA correlated data products view to indicate whether it has been compressed with Dysco. This figure shows a list of correlated data products for a given averaging pipeline and the Storage Writer column (containing the string DyscoStorageManager) indicates that these data products have been compressed with Dysco.
Design: Kuenst.    Development: Dripl.    © 2020 ASTRON