Difference between revisions of "GlueX Data Challenge Meeting, December 17, 2012"

From GlueXWiki
Jump to: navigation, search
(Created page with "GlueX Data Challenge Meeting<br> Monday, December 17, 2012<br> 1:30 pm, EDT<br> JLab: CEBAF Center, F326/327 =Agenda= # Announcements # [[GlueX Data Challenge Meeting, December...")
 
(Curtis's Thoughts)
 
(13 intermediate revisions by the same user not shown)
Line 8: Line 8:
 
# Announcements
 
# Announcements
 
# [[GlueX Data Challenge Meeting, December 3, 2012#Minutes|Minutes from last time]]
 
# [[GlueX Data Challenge Meeting, December 3, 2012#Minutes|Minutes from last time]]
# Data Challenge 1 status
+
# [[Data Challenge 1]] status
 
## JLab
 
## JLab
 
## Grid status
 
## Grid status
 
## CMU status
 
## CMU status
# Shutdown plan (or Continuation Plan?)
+
# Shutdown plan (or continuation plan?)
 
# Work list for post DC-1 period
 
# Work list for post DC-1 period
 +
## file archiving
 +
## file distribution
 +
## ???
 
# Thoughts on DC-2
 
# Thoughts on DC-2
 
+
## What?
# [https://halldweb1.jlab.org/data_challenge/01/conditions/data_challenge_1.html Conditions for DC-1]
+
## How much?
 +
## When?
  
 
=Meeting Connections =
 
=Meeting Connections =
Line 43: Line 47:
 
Present:
 
Present:
  
* '''CMU''': Paul Mattione, Curtis Meyer
+
* '''CMU''': Paul Mattione
* '''JLab''': Mark Ito (chair), David Lawrence, Dmitry Romanov, Simon Taylor
+
* '''JLab''': Mark Ito (chair), David Lawrence, Yi Qiang, Dmitry Romanov, Elton Smith, Simon Taylor, Beni Zihlmann
 
* '''UConn''': Richard Jones
 
* '''UConn''': Richard Jones
  
==Conditions for DC-1==
+
==Data Challenge 1 status==
 
+
Last Monday Mark sent out a [https://mailman.jlab.org/pipermail/halld-offline/2012-November/001169.html list of proposals/decisions] about details of the data challenge. We revisited some of these items.
+
 
+
===Keeping track of particle parentage===
+
 
+
Richard pointed out that a [https://mailman.jlab.org/pipermail/halld-offline/2012-December/001186.html recent change] that Paul pulled from the trunk into the dc-1.1 branch did not track parent particle ID correctly as intended, rather it only identified parent particle type. Paul needs this capability for his Cerenkov studies. Richard has suggested a change to the code to accomplish this and volunteered to implement it. He will do so and check it in on branches/sim-recon-dc-1.1. Mark promised to merge this change onto the trunk at the appropriate point in the future.
+
 
+
===Events per file===
+
 
+
Richard told us that 50 k events per job was [https://mailman.jlab.org/pipermail/halld-offline/2012-December/001185.html about optimum for grid sites]. We adopted that number for all of the sites.
+
  
===Inclusion of out-of-time interactions===
+
Production started at the three sites Wednesday, December 5, as planned.
  
Richard proposed [https://mailman.jlab.org/pipermail/halld-offline/2012-December/001182.html inclusion of out-of-time interactions], both electromagnetic and hadronic. He pointed out that this adds realism which would be essential for rare decay studies. Although that point was not disputed, we decided that introducing this feature could be deferred to later challenges. Accidental background, should not contribute to lowest order in evaluating particle ID options. In the mean time we need to evaluate how our reconstruction performs with realistic background introduced. We have not been running in this mode much heretofore.
+
We updated progress at the various sites:
  
===Saving the Monitoring Histograms===
+
* JLab: 678 million events
 +
* Grid: 3.4 billion events
 +
* CMU: 270 million events
  
Richard thought we should, on the whole, dump them; Mark thought we should keep them. We agreed to keep them all after Mark promised to look at every single histogram file personally and with great care (yeah, right, ;-) ). We will do this at least initially, as we go through the challenge we will likely re-visit this.
+
See the [[Data Challenge 1]] page for a few more details.
  
===File Numbers===
+
We ran down some of the problems encountered:
 +
* A lot of the time getting the grid effort started was spent correcting problems. Since some jobs, after resubmitting themselves after crashing would crash again, activity got into a state where a majority of the jobs were in this infinite loop and had to be stopped by hand. This was solved by lowering the number of resubmissions allowed.
 +
* There were occasional segmentation faults in hdgeant. Richard is investigating the cause.
 +
* mcsmear would sometimes hang. David and Richard chased this down to the processing thread taking more than 30 seconds with and event and then killing and re-launching itself without releasing the mutex lock for the output file.
 +
** Re-running the job fixed this problem because mcsmear was seeded differently each time.
 +
** The lock-release problem will be fixed.
 +
** We have to find out why it can take more than 30 seconds to smear an event.
 +
** The default behavior should be changed to a hard crash. Re-launching threads could still be retained as an option.
 +
* At JLab some jobs would not produce output files, but would only end after exceeding the job CPU limit.
 +
* Also at JLab, some of the REST format files did not have the full 50,000 events.
 +
* There may be other failure modes that we have not cataloged. We will at least try to figure out what happened with all failures.
 +
* At the start of the grid effort the submission node crashed. It was replaced with a machine with more memory which solved the problem. We peaked ou8t at 7,000 grid jobs running simultaneously. This was about 10% of the total grid capacity.
 +
* Another host for the grid system, the user scheduler which maintains a daemon for each job, also needed more memory to function under this load.
 +
* The storage resource manager (SRM), that does the transfer of the output files back to UConn in this case was very reliable. The gigabit pipe back to UConn was essentially filled during this effort.
 +
* Richard thought that next time we should do 100 million events and then go back and debug the code. Mark reminded us that the thinking was that the failure rate was low enough to do useful work and that it was more important to get the data challenge going and learn our lessons, since we will have other challenges in the future. [Note added in press: coincidentally, 100 million was the size of our standard mini-challenge. Folks will recall that those challenges started out with unacceptable failure rates and iterated to iron out the kinks.]
  
We changed the file number assignments as follows:
+
==Curtis's Thoughts==
  
<table border>
+
Curtis sent around an [[Curtis on DC-1|email]] with his assessment of our status and where he thinks we should go from here. Most notably, we suggests we write a report on DC-1.
<tr><th>Site<th>File Number Range
+
<tr><td>JLab<td>0000000-0000001
+
<tr><td>CMU<td>0500001-1000000
+
<tr><td>Grid<td>1000001-9999999
+
</table>
+
  
===Random Number Traps===
+
==Shutdown/Continuation Plan==
  
We will use the file number to seed BGGEN and then let other seeds default. Richard raised the concern that seeds can repeat, but after looking up, the repetition period is 10<sup>43</sup> which should give us plenty of room.
+
There was consensus that given that we have already exceeded out original goals by over a factor of two that we should stop submitting more jobs and assess where we are. The expectation is that currently submitted jobs will run out in a day or two.
  
==Site Status==
+
==Work list for post DC-1 period==
  
* CMU: Paul needs to see if the new 50 k run size helps his jobs co-exist memory-wise. He ran over the weekend and was able to get 20 M events a day using 192 cores.
+
* We decided that we would archive all files to the JLab tape library, REST files, ROOT files, and log files. Details have to be worked out, but we should do this right away.
* JLab: Mark needs to figure out why his CPU time estimate went so badly awry.
+
* To distribute the data, we will move all of the REST data to UConn and make it available via the SRM. Note that most of the data is at UConn already anyway.
 +
* We will also try to have all of the REST data on disk at JLab.
 +
* We should look into SURA grid and see if we have any claim on its resources.
 +
* Paul suggested doing skims of selected topologies for use by individuals doing specific analyses. Those interested in particular types of events should think about making proposals.
 +
* Richard suggested we develop a Jana plug-in to read data using the SRM directly. The only URL would have to be known and data could be streamed in.
 +
* To enable general access to the data, we decided that we all get grid certificates, i. e., obtain credentials for the entire collaboration. Richard will send instructions on how to get started with this.
 +
* Problems to address:
 +
** seg faults in hdgeant
 +
** hangs in mcsmear
 +
** random number seed control
  
==Start Date==
+
==Thoughts on DC-2==
  
The change to enable correct parentage tracking says that we do not have the final software stack yet. Mark will touch base with Richard and Paul on Wednesday to see where we stand on this and other issues.
+
We need to start thinking about the next data challenge, in particular, goals and schedule.

Latest revision as of 15:09, 18 December 2012

GlueX Data Challenge Meeting
Monday, December 17, 2012
1:30 pm, EDT
JLab: CEBAF Center, F326/327

Agenda

  1. Announcements
  2. Minutes from last time
  3. Data Challenge 1 status
    1. JLab
    2. Grid status
    3. CMU status
  4. Shutdown plan (or continuation plan?)
  5. Work list for post DC-1 period
    1. file archiving
    2. file distribution
    3.  ???
  6. Thoughts on DC-2
    1. What?
    2. How much?
    3. When?

Meeting Connections

To connect from the outside:

Videoconferencing

  1. ESNET:
    • Call ESNET Number 8542553 (this is the preferred connection method).
  2. EVO:
    • A conference has been booked under "GlueX" from 1:00pm until 3:30pm (EST).
    • Direct meeting link
    • To phone into an EVO meeting, from the U.S. call (626) 395-2112 and then enter the EVO meeting code, 13 9993
    • Skype Bridge to EVO

Telephone

  1. Phone: (should not be needed)
    • +1-866-740-1260 : US and Canada
    • +1-303-248-0285 : International
    • then use participant code: 3421244# (the # is needed when using the phone)
    • or www.readytalk.com
      • then type access code 3421244 into "join a meeting" (you need java plugin)

Minutes

Present:

  • CMU: Paul Mattione
  • JLab: Mark Ito (chair), David Lawrence, Yi Qiang, Dmitry Romanov, Elton Smith, Simon Taylor, Beni Zihlmann
  • UConn: Richard Jones

Data Challenge 1 status

Production started at the three sites Wednesday, December 5, as planned.

We updated progress at the various sites:

  • JLab: 678 million events
  • Grid: 3.4 billion events
  • CMU: 270 million events

See the Data Challenge 1 page for a few more details.

We ran down some of the problems encountered:

  • A lot of the time getting the grid effort started was spent correcting problems. Since some jobs, after resubmitting themselves after crashing would crash again, activity got into a state where a majority of the jobs were in this infinite loop and had to be stopped by hand. This was solved by lowering the number of resubmissions allowed.
  • There were occasional segmentation faults in hdgeant. Richard is investigating the cause.
  • mcsmear would sometimes hang. David and Richard chased this down to the processing thread taking more than 30 seconds with and event and then killing and re-launching itself without releasing the mutex lock for the output file.
    • Re-running the job fixed this problem because mcsmear was seeded differently each time.
    • The lock-release problem will be fixed.
    • We have to find out why it can take more than 30 seconds to smear an event.
    • The default behavior should be changed to a hard crash. Re-launching threads could still be retained as an option.
  • At JLab some jobs would not produce output files, but would only end after exceeding the job CPU limit.
  • Also at JLab, some of the REST format files did not have the full 50,000 events.
  • There may be other failure modes that we have not cataloged. We will at least try to figure out what happened with all failures.
  • At the start of the grid effort the submission node crashed. It was replaced with a machine with more memory which solved the problem. We peaked ou8t at 7,000 grid jobs running simultaneously. This was about 10% of the total grid capacity.
  • Another host for the grid system, the user scheduler which maintains a daemon for each job, also needed more memory to function under this load.
  • The storage resource manager (SRM), that does the transfer of the output files back to UConn in this case was very reliable. The gigabit pipe back to UConn was essentially filled during this effort.
  • Richard thought that next time we should do 100 million events and then go back and debug the code. Mark reminded us that the thinking was that the failure rate was low enough to do useful work and that it was more important to get the data challenge going and learn our lessons, since we will have other challenges in the future. [Note added in press: coincidentally, 100 million was the size of our standard mini-challenge. Folks will recall that those challenges started out with unacceptable failure rates and iterated to iron out the kinks.]

Curtis's Thoughts

Curtis sent around an email with his assessment of our status and where he thinks we should go from here. Most notably, we suggests we write a report on DC-1.

Shutdown/Continuation Plan

There was consensus that given that we have already exceeded out original goals by over a factor of two that we should stop submitting more jobs and assess where we are. The expectation is that currently submitted jobs will run out in a day or two.

Work list for post DC-1 period

  • We decided that we would archive all files to the JLab tape library, REST files, ROOT files, and log files. Details have to be worked out, but we should do this right away.
  • To distribute the data, we will move all of the REST data to UConn and make it available via the SRM. Note that most of the data is at UConn already anyway.
  • We will also try to have all of the REST data on disk at JLab.
  • We should look into SURA grid and see if we have any claim on its resources.
  • Paul suggested doing skims of selected topologies for use by individuals doing specific analyses. Those interested in particular types of events should think about making proposals.
  • Richard suggested we develop a Jana plug-in to read data using the SRM directly. The only URL would have to be known and data could be streamed in.
  • To enable general access to the data, we decided that we all get grid certificates, i. e., obtain credentials for the entire collaboration. Richard will send instructions on how to get started with this.
  • Problems to address:
    • seg faults in hdgeant
    • hangs in mcsmear
    • random number seed control

Thoughts on DC-2

We need to start thinking about the next data challenge, in particular, goals and schedule.