Planning for The Next GlueX Data Challenge

From GlueXWiki
Jump to: navigation, search

Introduction

This document will provide planning information for the second GlueX data challenge which is planned for early 2013.

Earlier Data Challenges

The first data challenge was carried out in 2011 [1], and studied the reaction γp → π+π−π+n using the full GlueX simulation [2] and reconstruction software [3]. The study included the full pythia background for about 3.5 hours of 108 running, with the appropriate level of the 3π signal injected into the data stream. One of the limitations on the size of this first challenge was the available disk storage at the remote sites (Connecticut and Indiana) where the work was carried out.

Goals of The Next Data Challenge

The primary goal of the next data challenge is to test as many aspects of the GlueX analysis chain as possible using a large-enough data set to push things close to what will actually be encountered in processing real GlueX data in 2016. As such, we anticipate that the initial data set would be at least an order of magnitude larger than that used in the earlier challenge.
• Check the large-scale batch processing of a large GlueX data set.
• Implement monitoring tools to adequately handle the large-scale production.
• Test schemes for accessing large amounts of processed data for analysis.
• Finalize the DST format for GlueX.
• Test tools (grid-ftp) for moving large amounts between JLab and outside sites.
• Develop data analysis tools and frameworks that utilize the DSTs.
In this regard, it is probably useful to break the work into two separate components. One carried out mostly at JLab to gain experience and resolve issues associated with the large-scale batch processing. The second can be a remote-site challange where events are simulated, and then reconstructed to the DST level. Only the DST-level events would then be available.

Procedures

Generate pythia Events

The first step will be to generate the appropriate sized sample of pythia events. These can be generated both on the JLab farms, as well as at offsite locations. In order for the offsite to fully function, we will need to have gridftp tools working at JLab that allow us to write files to the tape silos. The proposal would be to write the GEANT files at a level before mcsmear is run on the events. Thus, the heavy CPU would have been done, but the files would be reusable assuming that smearing changed.

The number of events comes roughly from the event rate. At 107 running, we estimate 20 kHz of hadronic trigger rate from the pythia background. Thus, each hour of beam corresponds to 7.2 × 106 pythia events. The original challenge was 35 of these hours, so we are looking at at least 350 hours, or 25.2 × 109 events.

The first question is whether the Monte Carlo is ready to produce these events? The one issue that we know of has to do with the dark hits in the BCAL. Are there other issues that would prevent us from starting this task? We also ask what the output file size should be? What is the current maximum supported at Jefferson Lab, both on disk and on tape?

Action Items

• Is HDGeant in shape to run these events?
• What is the number of pythia triggers to run?
• Is gridftp available to allow off-site Monte Carlo running?
• What is the file size?
• Notify JLab IT that we are starting this process and give them a feeling for the resources that we will be using: cpus, disk storage, tape storage, network bandwidth....

Inject Physics into the Sample

In addition to the pythia background events, we need to inject physics signals into the the Mon- teCarlo Sample at the appropriate rate based on known cross sections. These signals should corre- spond to several different final states that will provide an opportunity to study a number of different physics channels. We not that these events to not need to be physically spread into the pythia files from above, but can live in there own event samples. A partial list of what would be good to include is:
• γp → pπ+π−π0.
• γp → pπ0π0π0.
• γp → nπ+π−π+.
• γp → nπ+π0π0.
• γp → pη′π0.
• γp → nη′π+.
• γp → pb±1 π∓.
• γp→pb01π0.
• γp → nb+1 π0.
• γp→nb01π+.
• γp→pb01π0.
• γp → Ξ−K+K+.
• γp → Ξ∗−K+K+.
Appropriate generators need to be developed by the physics working group.

Action Items

• Is HDGeant in shape to run these events?
• What are the reactions to be run?
• Are the event generators for these channels ready?
• What is the number of each event type to run?

Producing DSTs from the Monte Carlo

The main test here is of the batch processing system for handling the large number of files. The tools that support the JLab batch farms have issues with error recovery and require a lot of baby sitting to move large amount of data through. The key thing here is that the data need to come off the tape silos and be processed. The resulting DSTs both go back to the tape silo, and become a part of the DST pool for analysis. The tools need to be able to automatically assure that all the necessary jobs are processed with minimal human intervetion, and the resulting DST files go into a catalog that allows us to easily access the data. During our review, Torre Wenaus of BNL suggested that some of the LHC tools may be of interest. We need to follol up on some of these. In addition to being able to handle the large number of jobs, there are associated issues with respect to data format that eventually need to be resolved. We will have tools that can convert the GEANT output into the online format, so we could exapnd the test to include this conversion. One could also imagine testing the piping of these “online” data from the Hall-D counting house as well. While important, the selection and implentation of tools for batch systems should not wait for this. We also need to have a reasonable DST format that will work for analaysis. There is an initial format now, but the only was to really guarantee that we have what we need is to test it. Thus, this is also something that we can change. A good way to start may simply be to exercise the system with the current JLab batch farms and identify the issues that we need to address to make things more robust. It is also important to view this processsing as something that we will do more than once, so we do not need to wait to get everything perfect before starting on tests.

Action Items

The following list is a list of things that will eventually need to be ready for this test. We should probably select the minimal path to get the big things done, and anticipate multiple passes over the data.
• Is mcsmear in shape to process the Monte Carlo events?
• Do we push the mcsmear through reconstruction, or convert to online formats first?
• Is HDAna ready to process the events?
• Is the DST format defined?
• Do we have a tool in place to manage the jobs?
• Do we have tools to catalog the output files?
• Notify JLab IT that we are starting this process and give them a feeling for the resources that we will be using: cpus, disk storage, tape storage, network bandwidth....

Physics Analysis of DSTs

Action Items

• Is the DST format defined? • How will we dsitribute the DST events to analaysis? Explore the CLEO-c tools for keeping them live on disk with internal pointers. • Do we have tools to produce mini-DSTs for distribution to remote sites?

Amplitude Analysis of mini-DSTs

Action Items

• ....

Conclusions