Difference between revisions of "GlueX Data Challenge Meeting, December 17, 2012"

From GlueXWiki
Jump to: navigation, search
(Data Challenge 1 status)
(Data Challenge 1 status)
Line 79: Line 79:
 
* Richard thought that next time we should do 100 million events and then go back and debug the code. Mark reminded us that the thinking was that the failure rate was low enough to do useful work and that it was more important to get the data challenge going and learn our lessons, since we will have other challenges in the future. [Note added in press: coincidentally, 100 million was the size of our standard mini-challenge. Folks will recall that those challenges started out with unacceptable failure rates and iterated to iron out the kinks.]
 
* Richard thought that next time we should do 100 million events and then go back and debug the code. Mark reminded us that the thinking was that the failure rate was low enough to do useful work and that it was more important to get the data challenge going and learn our lessons, since we will have other challenges in the future. [Note added in press: coincidentally, 100 million was the size of our standard mini-challenge. Folks will recall that those challenges started out with unacceptable failure rates and iterated to iron out the kinks.]
  
other host: user scheduler, maintains a daemon for each job, needed more memory
+
==Shutdown/Continuation Plan==
srm that receives the results coming back, 20 TB of disk
+
robust
+
100 MB, fills GB pipe
+
  
100 million events and go back to debug the code
+
There was consensus that given that we have already exceeded out original goals by over a factor of two that we should stop submitting more jobs and assess where we are. The expectation is that currently submitted jobs will run out in a day or two.
 
+
10% being used right now
+
only one person
+
  
 
archive all files to JLab tape library
 
archive all files to JLab tape library

Revision as of 12:01, 18 December 2012

GlueX Data Challenge Meeting
Monday, December 17, 2012
1:30 pm, EDT
JLab: CEBAF Center, F326/327

Agenda

  1. Announcements
  2. Minutes from last time
  3. Data Challenge 1 status
    1. JLab
    2. Grid status
    3. CMU status
  4. Shutdown plan (or continuation plan?)
  5. Work list for post DC-1 period
    1. file archiving
    2. file distribution
    3.  ???
  6. Thoughts on DC-2
    1. What?
    2. How much?
    3. When?

Meeting Connections

To connect from the outside:

Videoconferencing

  1. ESNET:
    • Call ESNET Number 8542553 (this is the preferred connection method).
  2. EVO:
    • A conference has been booked under "GlueX" from 1:00pm until 3:30pm (EST).
    • Direct meeting link
    • To phone into an EVO meeting, from the U.S. call (626) 395-2112 and then enter the EVO meeting code, 13 9993
    • Skype Bridge to EVO

Telephone

  1. Phone: (should not be needed)
    • +1-866-740-1260 : US and Canada
    • +1-303-248-0285 : International
    • then use participant code: 3421244# (the # is needed when using the phone)
    • or www.readytalk.com
      • then type access code 3421244 into "join a meeting" (you need java plugin)

Minutes

Present:

  • CMU: Paul Mattione
  • JLab: Mark Ito (chair), David Lawrence, Yi Qiang, Dmitry Romanov, Elton Smith, Simon Taylor, Beni Zihlmann
  • UConn: Richard Jones

Data Challenge 1 status

Production started at the three sites Wednesday, December 5, as planned.

We updated progress at the various sites:

  • JLab: 678 million events
  • Grid: 3.4 billion events
  • CMU: 270 million events

See the Data Challenge 1 page for a few more details.

We ran down some of the problems encountered:

  • A lot of the time getting the grid effort started was spent correcting problems. Since some jobs, after resubmitting themselves after crashing would crash again, activity got into a state where a majority of the jobs were in this infinite loop and had to be stopped by hand. This was solved by lowering the number of resubmissions allowed.
  • There were occasional segmentation faults in hdgeant. Richard is investigating the cause.
  • mcsmear would sometimes hang. David and Richard chased this down to the processing thread taking more than 30 seconds with and event and then killing and re-launching itself without releasing the mutex lock for the output file.
    • Re-running the job fixed this problem because mcsmear was seeded differently each time.
    • The lock-release problem will be fixed.
    • We have to find out why it can take more than 30 seconds to smear an event.
    • The default behavior should be changed to a hard crash. Re-launching threads could still be retained as an option.
  • At JLab some jobs would not produce output files, but would only end after exceeding the job CPU limit.
  • Also at JLab, some of the REST format files did not have the full 50,000 events.
  • There may be other failure modes that we have not cataloged. We will at least try to figure out what happened with all failures.
  • At the start of the grid effort the submission node crashed. It was replaced with a machine with more memory which solved the problem. We peaked ou8t at 7,000 grid jobs running simultaneously. This was about 10% of the total grid capacity.
  • Another host for the grid system, the user scheduler which maintains a daemon for each job, also needed more memory to function under this load.
  • The storage resource manager (SRM), that does the transfer of the output files back to UConn in this case was very reliable. The gigabit pipe back to UConn was essentially filled during this effort.
  • Richard thought that next time we should do 100 million events and then go back and debug the code. Mark reminded us that the thinking was that the failure rate was low enough to do useful work and that it was more important to get the data challenge going and learn our lessons, since we will have other challenges in the future. [Note added in press: coincidentally, 100 million was the size of our standard mini-challenge. Folks will recall that those challenges started out with unacceptable failure rates and iterated to iron out the kinks.]

Shutdown/Continuation Plan

There was consensus that given that we have already exceeded out original goals by over a factor of two that we should stop submitting more jobs and assess where we are. The expectation is that currently submitted jobs will run out in a day or two.

archive all files to JLab tape library logs, histos, rest

distribution: ship all rest files to UConn, access via srm have all files spinning at JLab

SURA grid,

skims

srm plug-in

grid certificate, collaboration wide archive

set faujlts in hdgeant jana hangs relaunch random seed


--end of note--