Difference between revisions of "GlueX Offline Meeting, April 16, 2014"

From GlueXWiki
Jump to: navigation, search
(Agenda)
m (Text replacement - "www/halldweb1/" to "www/halldweb/")
 
(14 intermediate revisions by 4 users not shown)
Line 8: Line 8:
 
# Announcements
 
# Announcements
 
# Review of [[GlueX Offline Meeting, April 2, 2014#Minutes|minutes from the last meeting]]: all
 
# Review of [[GlueX Offline Meeting, April 2, 2014#Minutes|minutes from the last meeting]]: all
 +
# [https://mailman.jlab.org/pipermail/halld-physics/2014-April/000394.html Kinematic Fitter Update] (Paul)
 
# Data Challenge 2
 
# Data Challenge 2
 
## Data Challenge Meeting Report, [[GlueX Data Challenge Meeting, April 11, 2014|April 11]] (Curtis)
 
## Data Challenge Meeting Report, [[GlueX Data Challenge Meeting, April 11, 2014|April 11]] (Curtis)
 +
##* [https://docs.google.com/spreadsheets/d/1qvF9B-76gr8NdsTKsO17jqL0qc5OXqK46JluvXnJ98k/edit?usp=sharing Event Tally Board]
 
## Site Status Updates as appropriate
 
## Site Status Updates as appropriate
## More Data Challenge Meetings
+
##* [[ CMU Data Challenge 2 | CMU ]]
### [[Mattione_Update_04212014 | Skimming Data Challenge Data]] (Paul)
+
##* JLab
 +
##** [[Media:Jobs gluex 04-16.png|Job History at JLab]]
 +
##** [[Media:Farm 2014.png|All Farm Jobs]]
 +
##** [https://mailman.jlab.org/pipermail/halld-offline/2014-April/001652.html Ramping Down Email]
 +
##* OSG
 +
##** [http://gryphn.phys.uconn.edu/vofrontend/monitor/frontendStatus.html Recent Job History on the Grid]
 +
## [[Mattione_Update_04212014 | Skimming Data Challenge Data]] (Paul)
 +
## [https://mailman.jlab.org/pipermail/halld-offline/2014-April/001654.html Data Distribution]
 +
##* the osg generated results from dc2 are being stored at /Gluex/test on the UConn srm
 +
##* the location on the Northwestern University srm is /mnt/xrootd/gluex/dc2
 +
##* for instructions on how to access files over srm, see [[Using the Grid#Accessing stored data over SRM|the appropriate section of the howto]].
 +
## More Data Challenge Meetings?
 
# AOT
 
# AOT
  
Line 26: Line 39:
 
==Slides==
 
==Slides==
  
Talks can be deposited in the directory <code>/group/halld/www/halldweb1/html/talks/2014-2Q</code> on the JLab CUE. This directory is accessible from the web at https://halldweb1.jlab.org/talks/2014-2Q/ .
+
Talks can be deposited in the directory <code>/group/halld/www/halldweb/html/talks/2014-2Q</code> on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2014-2Q/ .
 +
 
 +
=Minutes=
 +
 
 +
Present:
 +
* '''CMU''': Paul Mattione
 +
* '''FSU''': Volker Crede, Aristeidis Tsaris
 +
* '''IU''': Kei Moriya
 +
* '''JLab''': Mark Ito (chair), David Lawrence, Curtis Meyer, Dmitry Romanov, Simon Taylor
 +
* '''MIT''': Justin Stevens
 +
* '''NU''': Sean Dobbs
 +
* '''UConn''': Richard Jones [and others?]
 +
 
 +
==Review of Minutes from the Last Meeting==
 +
 
 +
We went over the [[GlueX Offline Meeting, April 2, 2014|minutes of the April 2 meeting]].
 +
 
 +
Kei commented on continued work on his study of REST file sizes and reproducibility. He has sent problem files to Paul and Simon and asked for feedback. He also indicated that this project may go to back burner for now, given the size of differences seen thus far.
 +
 
 +
==Data Challenge 2==
 +
 
 +
===Data Challenge Meeting Report, April 11===
 +
 
 +
Curtis re-capped the meeting.
 +
 
 +
Things were running well, with a very low failure rate. The OSG was just starting up; there had been problems with a site in Brazil that was accepting jobs which would bomb right away. Most of the sites are finished or winding down (with the exception of the OSG).
 +
 
 +
===Event Tally Board===
 +
 
 +
We took a look at the [https://docs.google.com/spreadsheets/d/1qvF9B-76gr8NdsTKsO17jqL0qc5OXqK46JluvXnJ98k/edit?usp=sharing board]. We are now at about 5 billion events. Note that this is already as many events as we had for data challenge 1, and these events are a factor of several more expensive than those in terms of CPU time.
 +
 
 +
===Site Status Updates===
 +
 
 +
'''MIT'''. Justin is still running on about 300 cores which include the FutureGrid cores. At some point soon those will have to go back. He has done about 40 million events over the past few days.
 +
 
 +
'''CMU'''. Paul has summarized results on [[CMU Data Challenge 2|his wiki page]], in section 5. He had only 3 failures in 7,000 jobs. He catalogs the reasons for those failures.
 +
 
 +
'''JLab'''. Mark showed the updated [[Media:Jobs gluex 04-16.png|plot of running jobs]] as a function of time for the entire data challenge period. He also showed a [[Media:Farm 2014.png|plot from Sandy Philpott]] showing all of jobs on the farm for the past three months. The steps in job numbers as nodes were switched from LQCD to the farm are clearly seen. Mark also reviewed [https://mailman.jlab.org/pipermail/halld-offline/2014-April/001652.html his message from Monday] announcing ramp down of the data challenge at JLab and the return of nodes from the farm to LQCD.
 +
 
 +
'''OSG'''. We looked at [[Media:Grid_jobs.png|recent job history]] on the Grid. Richard reported that he got a big batch of jobs through and that some of his grid proxies were getting stale and so he paused recently and is starting back up. The Purdue site has withdrawn its nodes indefinitely due to events in the aftermath of the Heartbleed bug. The GlueX sites (UConn and NU) have been operating at 98 to 99% efficiency ("productive"-CPU-time as fraction of wall-time). This is to be contrasted with 60% seen in DC-1. On some sites there was a problem seen where glide-ins were advertised to us, but the jobs were rejected due to renaming of the offered proxy from that used in previous running. This has been cleared up administratively. Support from the OSG has been very good. Once jobs start they generally run to completion. In general much smoother than last time. He reports on 2 failures out of 0.5 mega-jobs. Richard plans to continue to run until he can get two or three days of steady-state, problem-free running. He will try to balance our run mix as well.
 +
 
 +
==Kinematic Fitter Update==
 +
 
 +
Paul led us through [https://mailman.jlab.org/pipermail/halld-physics/2014-April/000394.html his email] announcing changes to his analysis library and the kinematic fitter in particular. His email has a complete description of the changes and interested parties should look there for details.
 +
 
 +
Kei asked about the recommended procedure for characterizing thrown particle topology. Paul told us that all of the thrown information is in the tree, but some navigation by hand is necessary to totally understand everything in the decay chain.
 +
 
 +
Justin asked about reasonable cuts for matching between charged tracks and clusters. Paul has a five sigma cut by default. This has not been extensively studied. These studies will have to be done to optimize the cut and may depend on what one is trying to do.
 +
 
 +
Justin also asked for clarification of the matching between the requested DReaction and the thrown information. For this there is no dependence on reconstructed information.
 +
 
 +
==Data Distribution==
 +
 
 +
Richard gave us guidance on accessing data challenge data.
 +
* the OSG generated results from dc2 are being stored at /Gluex/test on the UConn SRM
 +
* the location on the Northwestern University SRM is /mnt/xrootd/gluex/dc2
 +
* for instructions on how to access files over SRM, see the [[Using_the_Grid#Accessing_stored_data_over_SRM|appropriate section of the howto]].
 +
 
 +
He emphasized the best-practice of never attempting an srmls on a data directory. Instead one should fetch the .ls-l file in each directory. He also told us that [http://xrootd.org/ XRootD] is supported as a data transport convention on the SRM servers.
 +
 
 +
He also proposed that if Globus Online is the only parallel transport method supported by JLab, then we should deploy it at non-JLab sites, in the spirit of good collaboration. He thought that Globus Online Personal would not be compatible with his node at UConn and so a license would have to be bought to make it a full-fledged end-point. The cost appears not to be prohibitive.
 +
 
 +
We discussed continuing to explore options for transport, principally Globus Online, the OSG SRM, and raw GridFTP.
 +
 
 +
==Skimming Data Challenge Data==
 +
 
 +
Paul showed us a [[Mattione Update 04212014|list of proposed skims]] that we could perform on the data challenge events. They basically cover the waterfront. They are grouped in to three broad categories: non-strange meson channels, strange meson channels, and hyperon channels. With the analysis library in place, implementing any of these skims is not a big effort. As he states on his page, the cuts may have to be studied before going into production.
 +
 
 +
Given that we have his technology in hand, the discussion revolved around what we should do with it. The input is obviously the reconstructed REST data. The two possible approaches are:
 +
# Do a traditional skim, writing out only selected events to a smaller output file.
 +
# Implement the EventStore. Then multiple skims can be supported from a single set of files.
 +
We agreed that centralizing this function in the long run would eliminate duplicate application of manpower and waste of computing resources. We also noted that some of this processing could conceivably be done at non-JLab sites since it does not involve shipping around "raw" data.
 +
 
 +
We will explore both approaches. Paul will do a few pilot skims as a demonstration. Sean will look at the EventStore.
 +
 
 +
==Data Challenge Meetings?==
 +
 
 +
We decided to suspend the special data challenge meetings and put discussion back into the regular bi-weekly offline meeting.
 +
 
 +
==Other Offline Items==
 +
 
 +
David directed our attention to other items that need attention now that the production part of the data challenge is over.
 +
 
 +
# Tagger reconstruction is not in HDGeant. Although tagger hits are there, based on the thrown photon energy (not a detailed particle swim) the step to turn it back into a photon energy has not been done.
 +
# The online group is discussion moving online code to a separate subversion repository. Pluses and minuses of such a move have to be discussed in both online and offline working groups.
 +
# The version of the GlueX wiki is getting kind of old. We are running 1.17, from 2011, and the latest is 1.22. David will look into a version refresh. There is a related issue: changing the authentication to the standard JLab LDAP scheme, which seems like a good idea. That move may or may not have an influence on how we proceed.

Latest revision as of 18:44, 31 March 2015

GlueX Offline Software Meeting
Wednesday, April 16, 2014
1:30 pm EDT
JLab: CEBAF Center F326/327

Agenda

  1. Announcements
  2. Review of minutes from the last meeting: all
  3. Kinematic Fitter Update (Paul)
  4. Data Challenge 2
    1. Data Challenge Meeting Report, April 11 (Curtis)
    2. Site Status Updates as appropriate
    3. Skimming Data Challenge Data (Paul)
    4. Data Distribution
      • the osg generated results from dc2 are being stored at /Gluex/test on the UConn srm
      • the location on the Northwestern University srm is /mnt/xrootd/gluex/dc2
      • for instructions on how to access files over srm, see the appropriate section of the howto.
    5. More Data Challenge Meetings?
  5. AOT

Communication Information

  1. To join via Polycom room system go to the IP Address: 199.48.152.152 (bjn.vc) and enter the meeting ID: 292649009.
  2. To join via a Web Browser, go to the page https://bluejeans.com/292649009.
  3. To join via phone, use one of the following numbers and the Conference ID: 292649009
    • US or Canada: +1 408 740 7256 or
    • US or Canada: +1 888 240 2560


Slides

Talks can be deposited in the directory /group/halld/www/halldweb/html/talks/2014-2Q on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2014-2Q/ .

Minutes

Present:

  • CMU: Paul Mattione
  • FSU: Volker Crede, Aristeidis Tsaris
  • IU: Kei Moriya
  • JLab: Mark Ito (chair), David Lawrence, Curtis Meyer, Dmitry Romanov, Simon Taylor
  • MIT: Justin Stevens
  • NU: Sean Dobbs
  • UConn: Richard Jones [and others?]

Review of Minutes from the Last Meeting

We went over the minutes of the April 2 meeting.

Kei commented on continued work on his study of REST file sizes and reproducibility. He has sent problem files to Paul and Simon and asked for feedback. He also indicated that this project may go to back burner for now, given the size of differences seen thus far.

Data Challenge 2

Data Challenge Meeting Report, April 11

Curtis re-capped the meeting.

Things were running well, with a very low failure rate. The OSG was just starting up; there had been problems with a site in Brazil that was accepting jobs which would bomb right away. Most of the sites are finished or winding down (with the exception of the OSG).

Event Tally Board

We took a look at the board. We are now at about 5 billion events. Note that this is already as many events as we had for data challenge 1, and these events are a factor of several more expensive than those in terms of CPU time.

Site Status Updates

MIT. Justin is still running on about 300 cores which include the FutureGrid cores. At some point soon those will have to go back. He has done about 40 million events over the past few days.

CMU. Paul has summarized results on his wiki page, in section 5. He had only 3 failures in 7,000 jobs. He catalogs the reasons for those failures.

JLab. Mark showed the updated plot of running jobs as a function of time for the entire data challenge period. He also showed a plot from Sandy Philpott showing all of jobs on the farm for the past three months. The steps in job numbers as nodes were switched from LQCD to the farm are clearly seen. Mark also reviewed his message from Monday announcing ramp down of the data challenge at JLab and the return of nodes from the farm to LQCD.

OSG. We looked at recent job history on the Grid. Richard reported that he got a big batch of jobs through and that some of his grid proxies were getting stale and so he paused recently and is starting back up. The Purdue site has withdrawn its nodes indefinitely due to events in the aftermath of the Heartbleed bug. The GlueX sites (UConn and NU) have been operating at 98 to 99% efficiency ("productive"-CPU-time as fraction of wall-time). This is to be contrasted with 60% seen in DC-1. On some sites there was a problem seen where glide-ins were advertised to us, but the jobs were rejected due to renaming of the offered proxy from that used in previous running. This has been cleared up administratively. Support from the OSG has been very good. Once jobs start they generally run to completion. In general much smoother than last time. He reports on 2 failures out of 0.5 mega-jobs. Richard plans to continue to run until he can get two or three days of steady-state, problem-free running. He will try to balance our run mix as well.

Kinematic Fitter Update

Paul led us through his email announcing changes to his analysis library and the kinematic fitter in particular. His email has a complete description of the changes and interested parties should look there for details.

Kei asked about the recommended procedure for characterizing thrown particle topology. Paul told us that all of the thrown information is in the tree, but some navigation by hand is necessary to totally understand everything in the decay chain.

Justin asked about reasonable cuts for matching between charged tracks and clusters. Paul has a five sigma cut by default. This has not been extensively studied. These studies will have to be done to optimize the cut and may depend on what one is trying to do.

Justin also asked for clarification of the matching between the requested DReaction and the thrown information. For this there is no dependence on reconstructed information.

Data Distribution

Richard gave us guidance on accessing data challenge data.

  • the OSG generated results from dc2 are being stored at /Gluex/test on the UConn SRM
  • the location on the Northwestern University SRM is /mnt/xrootd/gluex/dc2
  • for instructions on how to access files over SRM, see the appropriate section of the howto.

He emphasized the best-practice of never attempting an srmls on a data directory. Instead one should fetch the .ls-l file in each directory. He also told us that XRootD is supported as a data transport convention on the SRM servers.

He also proposed that if Globus Online is the only parallel transport method supported by JLab, then we should deploy it at non-JLab sites, in the spirit of good collaboration. He thought that Globus Online Personal would not be compatible with his node at UConn and so a license would have to be bought to make it a full-fledged end-point. The cost appears not to be prohibitive.

We discussed continuing to explore options for transport, principally Globus Online, the OSG SRM, and raw GridFTP.

Skimming Data Challenge Data

Paul showed us a list of proposed skims that we could perform on the data challenge events. They basically cover the waterfront. They are grouped in to three broad categories: non-strange meson channels, strange meson channels, and hyperon channels. With the analysis library in place, implementing any of these skims is not a big effort. As he states on his page, the cuts may have to be studied before going into production.

Given that we have his technology in hand, the discussion revolved around what we should do with it. The input is obviously the reconstructed REST data. The two possible approaches are:

  1. Do a traditional skim, writing out only selected events to a smaller output file.
  2. Implement the EventStore. Then multiple skims can be supported from a single set of files.

We agreed that centralizing this function in the long run would eliminate duplicate application of manpower and waste of computing resources. We also noted that some of this processing could conceivably be done at non-JLab sites since it does not involve shipping around "raw" data.

We will explore both approaches. Paul will do a few pilot skims as a demonstration. Sean will look at the EventStore.

Data Challenge Meetings?

We decided to suspend the special data challenge meetings and put discussion back into the regular bi-weekly offline meeting.

Other Offline Items

David directed our attention to other items that need attention now that the production part of the data challenge is over.

  1. Tagger reconstruction is not in HDGeant. Although tagger hits are there, based on the thrown photon energy (not a detailed particle swim) the step to turn it back into a photon energy has not been done.
  2. The online group is discussion moving online code to a separate subversion repository. Pluses and minuses of such a move have to be discussed in both online and offline working groups.
  3. The version of the GlueX wiki is getting kind of old. We are running 1.17, from 2011, and the latest is 1.22. David will look into a version refresh. There is a related issue: changing the authentication to the standard JLab LDAP scheme, which seems like a good idea. That move may or may not have an influence on how we proceed.