Difference between revisions of "GlueX Offline Meeting, January 18, 2017"

From GlueXWiki
Jump to: navigation, search
(Agenda)
(Slides)
 
Line 34: Line 34:
  
 
Talks can be deposited in the directory <code>/group/halld/www/halldweb/html/talks/2017</code> on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2017/ .
 
Talks can be deposited in the directory <code>/group/halld/www/halldweb/html/talks/2017</code> on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2017/ .
 +
 +
== Minutes ==
 +
 +
Present:
 +
* '''FIU''': Mahmoud Kamel
 +
* '''JLab''': Alexander Austregesilo, Nathan Baltzell, Alex Barnes, Thomas Britton, Brad Cannon, Mark Ito (chair), Nathan Sparks, Kurt Strosahl, Simon Taylor, Beni Zihlmann
 +
* '''MIT''': Cristiano Fanelli
 +
* '''NU''':  Sean Dobbs
 +
* '''UConn''': Richard Jones + 2
 +
* '''W&M''': Justin Stevens
 +
 +
There is a [https://bluejeans.com/s/tgAip/ recording of this meeting] on the BlueJeans site.
 +
 +
=== Announcements ===
 +
 +
# '''Backups of the RCDB database in SQLite''' form are now being kept on the write through cache, in /cache/halld/home/gluex/rcdb_sqlite/. See [https://mailman.jlab.org/pipermail/halld-offline/2017-January/002573.html Mark's email] for more details.
 +
# '''Development of a wrapper for signal MC generation'''. Thomas has written scripts to wrap the basic steps of signal Monte Carlo generation. One can specify the number of events, the .input file to use for genr8, and jobs will be submitted via SWIF. Paul thought the the average user would find this useful. Mark suggested that the code could be version controlled with the [https://github.com/JeffersonLab/hd_utilities hd_utilities repository] on GitHub.
 +
# '''More Lustre space'''. Mark reported that our total Lustre space has been increased from a quota of 200 TB to 250 TB. See [https://mailman.jlab.org/pipermail/halld-offline/2017-January/002594.html his email] for a few more details.
 +
 +
=== Lustre system status ===
 +
 +
Kurt Strosahl, of JLab SciComp, dropped by to give us a report on the recent problems with the [https://en.wikipedia.org/wiki/Lustre_(file_system) Lustre file system]. This has affected our work, cache, and volatile directories. Lustre aggregates multiple partitions on multiple raid arrays, "block devices" or "Object Store Targets (OSTs), and presents to users a view of one large disk partition. There are redundant metadata systems to keep track of which files are where.
 +
 +
On New Year's Day, due to Infiniband problems, a fail-over from one metadata system to the other was initiated mistakenly. In the confusion both systems tried to mount a few of the OSTs causing corruption of the metadata for five of the 74. This was the first time a fail-over has occurred for a production system at JLab. Intel and SciComp have been working together to recover the metadata. The underlying files appear to be OK, but without the metadata they cannot be accessed. So far, metadata for four of the five OSTs has been repaired and it appears that their files have reappeared intact. This work has been going on for over two weeks now; there is no definite estimate on when the last OST will be recovered. Fail-over has been inhibited for now.
 +
 +
We asked Kurt about recent troubles with ifarm1102. That particular nodes has been having issues with its Infiniband interface and has now been removed from the rotation of ifarm machines.
 +
 +
=== Review of minutes from the last meeting ===
 +
 +
We went over the [[GlueX Offline Meeting, December 21, 2016#Minutes|minutes from December ]] (all)
 +
 +
* The problem with reading data with multiple threads turn out indeed to be from corrupted data. There was an issue with the raid arrays in the counting room.
 +
* Sean has had further discussions on how we handle HDDS XML files. There is a plan now.
 +
* Mark will ask about getting us an update on the OSG appliance.
 +
 +
=== Launches ===
 +
 +
Alex A. gave the report.
 +
 +
==== 2016-10 offline monitoring ver02 ====
 +
 +
Alex wanted to start this before the break, but calibration were not ready. Instead processing started the first week of January, but since there was not a lot of data in the run, it finished in a week. The gxproj1 account was used. There were some minor problems with the post processing scripts that have now been fixed.
 +
 +
We are waiting for new calibration results before starting ver03, perhaps sometime next week. There was an issue with the propagation delay calibration for the TOF that has now been resolved and there are on-going efforts with BCAL and FCAL calibrations. The monitoring launch gives us a check on the quality of the calibrations for the entire running period.
 +
 +
==== 2016-02 analysis launch ver05 ====
 +
 +
This launch started before the break. Jobs are running with only six threads. Large variation in execution time and peak memory use has been observed. The cause has been traced to a few channels what require many photons (e. g., 3&pi;<sup>0</sup>) that can generate huge numbers of combinations and stop progress on a single thread. Several solutions were discussed, including re-writing parts of the analysis library and cutting off processing for events that generate too many combinations. In addition, in the future the list of plugins may get trimmed. This last effort took the philosophy of running "everything" to see how many channels we can reasonably get through.
 +
 +
=== Sim 1.2 ===
 +
 +
Mark reported that the 50 k jobs that have been submitted are going through very slowly. Processing started in the middle of the break and is only 20% done and this batch is only 20% of the total we planned to simulate. The processing time is dominated by the generation of electromagnetic background independently for each event. After some discussion of the purpose of the resulting data set, we decided to re-launch the effort without generation of E&M background. The data should still be useful for studying efficiency/acceptance.
 +
 +
=== HDGeant/HDGeant4 Update ===
 +
 +
Richard gave us an update on the development effort.
 +
 +
* He is doing a tag-by-tag comparison of the output from HDGeant ("G3" for our purposes) and HDGeant4 ("G4"), comparing both truth and hit information. For 90% of the discrepancies he finds it is the new G4 code that needs fixing, but the other 10% come from G3 errors, mostly in truth information that is not looked at as often.
 +
* To do the comparison he has developed a new tool, hddm-root, that creates a ROOT tree auto-magically directly from an HDDM file. This allows quick histogramming of quantities for comparison.
 +
* Detectors where agreement has been verified: CDC, FDC, BCAL, FCAL, TOF, tagger, pair spectrometer (course and fine), triplet polarimeter.
 +
* The triplet polarimeter simulation was adopted from code from Mike Dugger and was originally implemented in G4, but has also been back-ported to G3.
 +
* To test the TPOL simulation, a new card has been introduced, GENB[?], that will track beam photons down the beamline from the radiator. It has three modes: pre-coll, post-coll, and post-conv which end tracking at the collimator, at the converter, and on through to the TPOL respectively. The generated particle information can be written out in HDDM format and serve as input to either G3 or G4, just as for any other of our event generators.
 +
* The coherent bremsstrahlung generator has been implemented in G4 and compared to that of G3.
 +
* "Fake" tagger hits are now being generated in G4 in the same manner as was done in G3. Also a new tag RFTime[?] has been introduced. It is a single time that sets the "true" phase of the RF used in the simulation.
 +
* Other detectors implemented: the DIRC, the MWPC (for the CPP experiment), and for completeness the gas RICH, the gas Cerenkov, and the UPV.
 +
* The MCTRAJECTORY card has been implemented in G4 and its implementation in G3 fixed. This allows output of position information for particle birth, death, and/or  points in between for primary and/or secondary particles in a variety of combinations of those items.
 +
* The following G3 cards have been implemented in G4. The secretary will refer the reader to the documentation in the sample control.in for definitions of most of these.
 +
** KINE
 +
** SCAT
 +
** TARGET
 +
** BGGATE
 +
** BGRATE
 +
* The following cards will not be implemented in G4
 +
** CUTS
 +
** SWITCH
 +
*** CUTS and SWITCH do not fit into the Geant4 design philosophy
 +
** GELHAD
 +
*** photonuclear interactions are now provided natively in Geant4
 +
* The following cards are being implemented now:
 +
** HADR
 +
*** The meaning in G4 has been modified to control turning on/off all hadronic interaction processes to save users the bother of doing so one by one
 +
** CKOV
 +
** LABS
 +
** NOSEC
 +
** AUTO
 +
** BFIELD_MAP
 +
** PSFIELD_MAP
 +
** SAVEHIT
 +
** SHOWERS_IN_COLLIMATOR
 +
** DRIFT_CLUSTERS
 +
** MULS
 +
** BREMS
 +
** COMPT
 +
** LOSS
 +
** PAIR
 +
** DECAY
 +
** DRAY
 +
 +
Beni will describe the scheme he implemented in G3 to preserve the identify of secondary particles and transmit the description to Richard.
 +
 +
Performance remains an issue but is not an area of focus at this stage. Richard has seen a slow-down of a factor of 24 per thread going from G3 to G4. At this point G4 is generating two-orders of magnitude more secondary particles, mostly neutrons, compared to G3. A simple kinetic energy threshold adjustment did not make much of a difference.
 +
 +
Sean made a couple of comments:
 +
# The problem Richard discovered with missing TDC hits in the BCAL has been traced to the generation of digi-hits for the BCAL. CCDB constants had to be adjusted to bring those hits back.
 +
# Caution should be used with the current pair spectrometer field map called for the in CCDB. It is only a preliminary rough guess.
 +
 +
Mark needs to create a standard build of G4.
 +
 +
Richard requested that if folks have problems, questions, or suggestions, they should log an [https://github.com/rjones30/HDGeant4/issues issue on GitHub].

Latest revision as of 20:56, 19 January 2017

GlueX Offline Software Meeting
Wednesday, January 18, 2017
2:00 pm EST
JLab: CEBAF Center F326/327

Agenda

  1. Announcements
    1. Backups of the RCDB database in SQLite form (Mark)
    2. Development of a wrapper for signal MC generation (Thomas)
    3. More Lustre space (Mark)
  2. Review of minutes from the last meeting (all)
  3. Lustre system status (Kurt Strosahl, SciComp)
    • ifarm1102 RIP
    • ifarm 1101 1401 Centos65
    • 1402 Centos7
  4. Launches (Alex A.)
    1. 2016-10 offline monitoring ver02
    2. 2016-02 analysis launch ver05
  5. Sim 1.2 (Sean, Mark)
  6. HDGeant/HDGeant4 Update (Richard)
  7. Review of recent pull requests (all)
  8. Review of recent discussion on the Gluex Software Help List.
  9. Action Item Review

Communication Information

Remote Connection

Slides

Talks can be deposited in the directory /group/halld/www/halldweb/html/talks/2017 on the JLab CUE. This directory is accessible from the web at https://halldweb.jlab.org/talks/2017/ .

Minutes

Present:

  • FIU: Mahmoud Kamel
  • JLab: Alexander Austregesilo, Nathan Baltzell, Alex Barnes, Thomas Britton, Brad Cannon, Mark Ito (chair), Nathan Sparks, Kurt Strosahl, Simon Taylor, Beni Zihlmann
  • MIT: Cristiano Fanelli
  • NU: Sean Dobbs
  • UConn: Richard Jones + 2
  • W&M: Justin Stevens

There is a recording of this meeting on the BlueJeans site.

Announcements

  1. Backups of the RCDB database in SQLite form are now being kept on the write through cache, in /cache/halld/home/gluex/rcdb_sqlite/. See Mark's email for more details.
  2. Development of a wrapper for signal MC generation. Thomas has written scripts to wrap the basic steps of signal Monte Carlo generation. One can specify the number of events, the .input file to use for genr8, and jobs will be submitted via SWIF. Paul thought the the average user would find this useful. Mark suggested that the code could be version controlled with the hd_utilities repository on GitHub.
  3. More Lustre space. Mark reported that our total Lustre space has been increased from a quota of 200 TB to 250 TB. See his email for a few more details.

Lustre system status

Kurt Strosahl, of JLab SciComp, dropped by to give us a report on the recent problems with the Lustre file system. This has affected our work, cache, and volatile directories. Lustre aggregates multiple partitions on multiple raid arrays, "block devices" or "Object Store Targets (OSTs), and presents to users a view of one large disk partition. There are redundant metadata systems to keep track of which files are where.

On New Year's Day, due to Infiniband problems, a fail-over from one metadata system to the other was initiated mistakenly. In the confusion both systems tried to mount a few of the OSTs causing corruption of the metadata for five of the 74. This was the first time a fail-over has occurred for a production system at JLab. Intel and SciComp have been working together to recover the metadata. The underlying files appear to be OK, but without the metadata they cannot be accessed. So far, metadata for four of the five OSTs has been repaired and it appears that their files have reappeared intact. This work has been going on for over two weeks now; there is no definite estimate on when the last OST will be recovered. Fail-over has been inhibited for now.

We asked Kurt about recent troubles with ifarm1102. That particular nodes has been having issues with its Infiniband interface and has now been removed from the rotation of ifarm machines.

Review of minutes from the last meeting

We went over the minutes from December (all)

  • The problem with reading data with multiple threads turn out indeed to be from corrupted data. There was an issue with the raid arrays in the counting room.
  • Sean has had further discussions on how we handle HDDS XML files. There is a plan now.
  • Mark will ask about getting us an update on the OSG appliance.

Launches

Alex A. gave the report.

2016-10 offline monitoring ver02

Alex wanted to start this before the break, but calibration were not ready. Instead processing started the first week of January, but since there was not a lot of data in the run, it finished in a week. The gxproj1 account was used. There were some minor problems with the post processing scripts that have now been fixed.

We are waiting for new calibration results before starting ver03, perhaps sometime next week. There was an issue with the propagation delay calibration for the TOF that has now been resolved and there are on-going efforts with BCAL and FCAL calibrations. The monitoring launch gives us a check on the quality of the calibrations for the entire running period.

2016-02 analysis launch ver05

This launch started before the break. Jobs are running with only six threads. Large variation in execution time and peak memory use has been observed. The cause has been traced to a few channels what require many photons (e. g., 3π0) that can generate huge numbers of combinations and stop progress on a single thread. Several solutions were discussed, including re-writing parts of the analysis library and cutting off processing for events that generate too many combinations. In addition, in the future the list of plugins may get trimmed. This last effort took the philosophy of running "everything" to see how many channels we can reasonably get through.

Sim 1.2

Mark reported that the 50 k jobs that have been submitted are going through very slowly. Processing started in the middle of the break and is only 20% done and this batch is only 20% of the total we planned to simulate. The processing time is dominated by the generation of electromagnetic background independently for each event. After some discussion of the purpose of the resulting data set, we decided to re-launch the effort without generation of E&M background. The data should still be useful for studying efficiency/acceptance.

HDGeant/HDGeant4 Update

Richard gave us an update on the development effort.

  • He is doing a tag-by-tag comparison of the output from HDGeant ("G3" for our purposes) and HDGeant4 ("G4"), comparing both truth and hit information. For 90% of the discrepancies he finds it is the new G4 code that needs fixing, but the other 10% come from G3 errors, mostly in truth information that is not looked at as often.
  • To do the comparison he has developed a new tool, hddm-root, that creates a ROOT tree auto-magically directly from an HDDM file. This allows quick histogramming of quantities for comparison.
  • Detectors where agreement has been verified: CDC, FDC, BCAL, FCAL, TOF, tagger, pair spectrometer (course and fine), triplet polarimeter.
  • The triplet polarimeter simulation was adopted from code from Mike Dugger and was originally implemented in G4, but has also been back-ported to G3.
  • To test the TPOL simulation, a new card has been introduced, GENB[?], that will track beam photons down the beamline from the radiator. It has three modes: pre-coll, post-coll, and post-conv which end tracking at the collimator, at the converter, and on through to the TPOL respectively. The generated particle information can be written out in HDDM format and serve as input to either G3 or G4, just as for any other of our event generators.
  • The coherent bremsstrahlung generator has been implemented in G4 and compared to that of G3.
  • "Fake" tagger hits are now being generated in G4 in the same manner as was done in G3. Also a new tag RFTime[?] has been introduced. It is a single time that sets the "true" phase of the RF used in the simulation.
  • Other detectors implemented: the DIRC, the MWPC (for the CPP experiment), and for completeness the gas RICH, the gas Cerenkov, and the UPV.
  • The MCTRAJECTORY card has been implemented in G4 and its implementation in G3 fixed. This allows output of position information for particle birth, death, and/or points in between for primary and/or secondary particles in a variety of combinations of those items.
  • The following G3 cards have been implemented in G4. The secretary will refer the reader to the documentation in the sample control.in for definitions of most of these.
    • KINE
    • SCAT
    • TARGET
    • BGGATE
    • BGRATE
  • The following cards will not be implemented in G4
    • CUTS
    • SWITCH
      • CUTS and SWITCH do not fit into the Geant4 design philosophy
    • GELHAD
      • photonuclear interactions are now provided natively in Geant4
  • The following cards are being implemented now:
    • HADR
      • The meaning in G4 has been modified to control turning on/off all hadronic interaction processes to save users the bother of doing so one by one
    • CKOV
    • LABS
    • NOSEC
    • AUTO
    • BFIELD_MAP
    • PSFIELD_MAP
    • SAVEHIT
    • SHOWERS_IN_COLLIMATOR
    • DRIFT_CLUSTERS
    • MULS
    • BREMS
    • COMPT
    • LOSS
    • PAIR
    • DECAY
    • DRAY

Beni will describe the scheme he implemented in G3 to preserve the identify of secondary particles and transmit the description to Richard.

Performance remains an issue but is not an area of focus at this stage. Richard has seen a slow-down of a factor of 24 per thread going from G3 to G4. At this point G4 is generating two-orders of magnitude more secondary particles, mostly neutrons, compared to G3. A simple kinetic energy threshold adjustment did not make much of a difference.

Sean made a couple of comments:

  1. The problem Richard discovered with missing TDC hits in the BCAL has been traced to the generation of digi-hits for the BCAL. CCDB constants had to be adjusted to bring those hits back.
  2. Caution should be used with the current pair spectrometer field map called for the in CCDB. It is only a preliminary rough guess.

Mark needs to create a standard build of G4.

Richard requested that if folks have problems, questions, or suggestions, they should log an issue on GitHub.