GlueX Data Challenge Meeting, February 14, 2014
GlueX Data Challenge Meeting
Friday, February 14, 2014
11:00 pm, EST
JLab: CEBAF Center, L207
ReadyTalk: (866)740-1260, access code: 1833622
ReadyTalk desktop: http://esnet.readytalk.com/ , access code: 1833622
- 1 Agenda
- 2 Minutes
- Status of Preparations
- Update on phi=0 geometry issues in CDC? Solved?
- Random number seeds procedure? Mark/Anyone
- Running jobs at CMU Paul
- Running jobs at NU Sean
- Running jobs at MIT Justin
- 100 jobs with 10K events each, no EM background: only 2 failed (with DTrackFitterKalmanSIMD)
- 100-job tests at various sites. Jlab/Grid/CMU,NWU, ....
- Electromagnetic Backgrounds update. Paul/Kei updated studies
- Check on event genealogy. Kei
- Preparations of standard distribution/scripts Mark/Richard/Sean
- Report on data management practices - Sean -
- Proposed Schedule
- Launch of Data Challenge Thursday Feb.27, 2014 (est.).
- Test jobs going successfully by Tuesday February 25.
- Distribution ready by Monday February 24.
- CMU: Paul Mattione, Curtis Meyer
- FSU: Volker Crede, Aristeidis Tsaris
- IU: Kei Moriya
- JLab: Mark Ito (chair), Chris Larrieu, Simon Taylor
- NWU: Sean Dobbs
Update on phi=0 geometry issues in CDC
Richard checked in a change. Simon and Beni confirm that this fixes the issue.
Random number seeds procedure
Still nothing new to report.
Running jobs at CMU
Paul has continued running jobs. Last time he reported a third of jobs crashing. Since then he increased the memory allowed from 2 GB to 5 GB. Now the crash rate is down to 5%.
Simon has been running Valgrind, but he sees a lot of issues being flagged in xerces, which is likely not our problem.
Paul has also been running hd_root withing the time command and is getting memory usage varying between 3 and 5 GB.
Running jobs at NU Sean
Sean has been seeing start up problems with hd_root, which seems to be correlated with whether the input and/or output files are on network disk or on the local machine. In particular, he sees crashes when reading in the geometry. This has not been seen at other sites, although the exact configuration which is giving him problems may not have been tried much. He is tracking down the error.
Running jobs at MIT
Justin has succeeded in running 100 jobs with 10 k events each, no EM background, at MIT on the OpenStack Cluster. He sees only 2 failures, those in DTrackFitterKalmanSIMD.
Running jobs at JLab
Mark updated the DC2 tag to bring in Simon's latest changes to tracking, as well as Richard's fix at φ=0 in the CDC. Mark ran 400 50-k-event jobs with same configuration as last week, except that the requested memory was doubled, at 3.0 GB up from 1.5 GB. He saw a much improved success rate, 84% success vs. 26% reported last week. A lot of the improvement was a reduced rate of breaching the memory limit, but not all. Two percent of the jobs failed because they exceeded 15 hours of wall time. At least one successful job took only 9 hours to finish.
Mark has also started to run very short jobs, to help Simon debug the tracking crashes. He ran 10,000 jobs of 1000 events each last weekend, and 4,000 jobs of 500 events each with the new DC2 version. For these, he is keeping the smeared data that is input to hd_root. Simon was able to use last weekend's jobs to find the errors discovered during the week.
Electromagnetic Backgrounds update
Kei gave us an update on this performance comparisons with and without electromagnetic background. See his slides for details. He compares memory usage and execution time for various combinations of background time gate and beam intensity. He also showed the effect on reconstructed quantities in slides titled as follows:
- Increase in Showers
- FCAL Shower time vs. energy
- Projections of Time and Energy of FCAL Showers
- FCAL Shower Time vs. Distance from Target
- FCAL Energy Showers After Timing Cut
- Generated E, p
- Generated p vs. θ
Check on event genealogy
During his studies on EM background Kei noticed that the energy and momentum of the sum of all primary particles (generated E, p, above) is now much more sensible than before. There are still some low-level tails that may have to do with particles re-entering the detector from the calorimeters and getting mis-classified.
We started to discuss this issue, but decided to postpone a full discussion until the collaboration meeting.
We will not meet next week due to the collaboration meeting, but will start up the week after. The main task for us to to continue tracking down the cause of job crashes.