Difference between revisions of "GlueX Software Meeting, July 21, 2020"

From GlueXWiki
Jump to: navigation, search
(copied)
 
m (Compiler upgrade discussion)
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
GlueX Software Meeting<br>
 
GlueX Software Meeting<br>
Tuesday, July 7, 2020<br>
+
Tuesday, July 21, 2020<br>
 
3:00 pm EDT<br>
 
3:00 pm EDT<br>
 
BlueJeans: [http://www.bluejeans.com/968592007 968 592 007]
 
BlueJeans: [http://www.bluejeans.com/968592007 968 592 007]
Line 7: Line 7:
  
 
# Announcements
 
# Announcements
## [https://mailman.jlab.org/pipermail/halld-offline/2020-June/004104.html New version set with upgrade to Geant4: version_4.21.3.xml] (Mark I.)
+
## [https://mailman.jlab.org/pipermail/halld-offline/2020-July/008264.html New version set, version_4.23.1.xml]
## [https://mailman.jlab.org/pipermail/halld-offline/2020-July/004107.html multi-threaded hdgeant4 now working] (Richard)
+
# Review of [[GlueX Software Meeting, July 7, 2020#Minutes|Minutes from the Last Software Meeting]] (all)
## [https://mailman.jlab.org/pipermail/halld-offline/2020-July/004108.html New release of Build Scripts: version 1.58] (Mark)
+
# [[HDGeant4_Meeting, July 14, 2020#Minutes|Report from the Last HDGeant4 Meeting]] (all)
## [https://mailman.jlab.org/pipermail/halld-offline/2020-July/004109.html New version set: version_4.22.0.xml] (Mark)
+
## [https://halldweb.jlab.org/talks/2020/crc_md5.pdf Checksum changes for files written to tape] (Mark)
+
# Review of [[GlueX Software Meeting, June 9, 2020#Minutes|Minutes from the Last Software Meeting]] (all)
+
# Report from Recent HDGeant4 Meetings (all)
+
## [[HDGeant4_Meeting, June 16, 2020#Minutes|June 16 Meeting]]
+
## [[HDGeant4_Meeting, June 30, 2020#Minutes|June 30 Meeting]]
+
#  [https://docs.google.com/presentation/d/1bmy-IYamSmEtv4HsOeXHgceD66vVPstLZXymLNf4tE0/edit?usp=sharing Report from the SciComp Meeting] (Mark)
+
# [https://docs.google.com/presentation/d/1BGYUvKztfzSGDGwcfTrs6DoS_EUUvaeTHReeYFAEKvc/edit?usp=sharing NERSC status](David)
+
# OSG Jobs and mcsmear (Thomas)
+
 
# [https://docs.google.com/presentation/d/1e1UpDpI0zc4pUe-lUsnij_10kZ2ItKv6ILvRrChv7GQ/edit?usp=sharing Compiler upgrade discussion] (all)
 
# [https://docs.google.com/presentation/d/1e1UpDpI0zc4pUe-lUsnij_10kZ2ItKv6ILvRrChv7GQ/edit?usp=sharing Compiler upgrade discussion] (all)
 
# Review of recent issues and pull requests:
 
# Review of recent issues and pull requests:
Line 38: Line 29:
 
== Minutes ==
 
== Minutes ==
  
Present: Alex Austregesilo, Thomas Britton, Sean Dobbs, Mark Ito (chair), Igal Jaegle, Richard Jones, Naomi Jarvis, David Lawrence, Keigo Mizutani, Susan Schadmand, Simon Taylor, Nilanga Wickramaarachchi, Beni Zihlmann
+
Present: Alex Austregesilo, Thomas Britton, Sean Dobbs, Mark Ito (chair), Richard Jones, Naomi Jarvis, David Lawrence, Susan Schadmand, Simon Taylor, Nilanga Wickramaarachchi, Beni Zihlmann
  
There is [https://bluejeans.com/s/MR_xjcUQ0sR/ a recording of his meeting] on the BlueJeans site. Use your JLab credentials to authenticate.
+
There is [https://bluejeans.com/s/0rIPo9NbCrd/ a recording of his meeting] on the BlueJeans site. Use your JLab credentials to authenticate.
  
 
=== Announcements ===
 
=== Announcements ===
  
# [https://mailman.jlab.org/pipermail/halld-offline/2020-June/004104.html New version set with upgrade to Geant4: version_4.21.3.xml] Ready for testers of an updated version of Geant4.
+
[https://mailman.jlab.org/pipermail/halld-offline/2020-July/008264.html New version set, version_4.23.1.xml]. The latest version set came out last Wednesday.
# [https://mailman.jlab.org/pipermail/halld-offline/2020-July/004107.html multi-threaded hdgeant4 now working] Richard implemented patches to Geant4 that fixed issues that prevented multi-threaded running from giving sensible results.
+
#* To address thread safety in the HDGeant4 code proper, he made a change to provide each thread with its own copy of the magnetic field map at the cost of 300 MB of memory per additional thread. This was done out of an abundance of caution; if not necessary he will back out the change to recover the memory used.
+
#* Richard proposed changing the CCDB to so that the magnetic field is no longer suppressed inside selected volumes, in particular in the BCAL. Other volumes that would get changed are the FCAL, TOF, DIRC, and CCAL where the effect was not manifest because the field is so much lower there. He discovered that the suppression resulted field discontinuities that caused crashes in particle propagation reported by Naomi. The suppression had originally been introduced to speed up shower development in GEANT, but Geant4 seems not to be affected by having the field on. We endorsed the proposal.
+
# [https://mailman.jlab.org/pipermail/halld-offline/2020-July/004108.html New release of Build Scripts: version 1.58] This new release will accommodate our current default as well as subsequent versions of Geant4.
+
# [https://mailman.jlab.org/pipermail/halld-offline/2020-July/004109.html New version set: version_4.22.0.xml]. This version set incorporates the fix to multi-threaded running of HDGeant4 mentioned above.
+
# [https://halldweb.jlab.org/talks/2020/crc_md5.pdf Checksum changes for files written to tape] The Computer Center is dropping MD5 checksum and will rely on CRC32 sums for tape data validation.
+
  
 
=== Review of Minutes from the Last Software Meeting ===
 
=== Review of Minutes from the Last Software Meeting ===
  
We went over [[GlueX Software Meeting, June 9, 2020#Minutes|the minutes from June 9]].
+
We went over [[GlueX Software Meeting, July 7, 2020#Minutes|the minutes from the meeting on July 7]].
  
==== New Release of JANA: version 0.8.2 ====
+
==== NERSC Status ====
  
David has addressed the request for suppressing of warnings when geometry paths are requested but not present. This is often a normal situation, e.g., when probing the geometry to see how the data should be analyzed. His fix is in a new release of JANA, version 0.8.2.
+
David gave us the run-down on the preparations for the next reconstruction launch at NERSC.
  
==== Jana 2.0 ====
+
* He has gone to a roster of plugins intermediate between the original 57, but more that the minimum used in REST production.
 +
* Igal Jaegle has looked at the latest round of monitoring histograms. They look good. Any missing plots were due to dropped plugin-ins.
 +
* Overall, the production system is ready to go.
 +
* David submitted 1,000 jobs recently. These represent a complete run. There were some SWIF2 issues that needed attention from Chris Larrieu.
 +
* The main outstanding issue is creating a complete set of fiducial times in the CCDB.
 +
** The slope of event time vs. the 250 MHz clock seems fine taking the nominal 250 MHz. The offsets need to be filled in.
 +
** Beam trip information is also missing from the CCDB. About 1 in 7 runs is missing presently.
 +
** There is an issue with SWIF2 where only the first of the ten jobs associated with a single raw data file succeeds. The following nine require re-submission. We could live with this, but not happily. Chris is working on this.
  
Nathan Brei continues work on porting halld_recon to use JANA 2.0. We will get a report at a future meeting on this new major release.
+
==== Developer-Friendly Container Build ====
  
=== Report from Recent HDGeant4 Meetings ===
+
Mark made small adjustments to the rsync of our container software to Oasis that allows building of halld_recon against Oasis. See [https://mailman.jlab.org/pipermail/halld-offline/2020-July/008257.html his recent email] for more details.
  
We went over [[HDGeant4_Meeting, June 16, 2020#Minutes|the minutes from the June 16 meeting]] and [[HDGeant4_Meeting, June 30, 2020#Minutes|those from the June 30 meeting]] without much comment.
+
==== Corrupt CCDB SQLite Files ====
  
=== Report from the SciComp Meeting ===
+
Mark changed the limit on the output file-size check when SQLite versions of the CCDB are produced. It is unlikely that the "Lost connection to MySQL server" will corrupt the file on Oasis anytime soon. If the file is not big enough, the previous version will not be replaced.
  
Please see [https://docs.google.com/presentation/d/1bmy-IYamSmEtv4HsOeXHgceD66vVPstLZXymLNf4tE0/edit?usp=sharing Mark's slides] for Scientific Computing news from the Computer Center.
+
=== Report from the Last HDGeant4 Meeting ===
  
=== NERSC status ===
+
# We went over [[HDGeant4_Meeting, July 14, 2020#Minutes|the minutes from the HDGeant4 Meeting on July 7]]. Thomas reported that he has seen the missing normal-error that Richard reported at that meeting. It seems to only occur for certain runs (simulations of specific real runs), but not always on the same event.
  
David brought us up-to-date on processing at NERSC. Please see [https://docs.google.com/presentation/d/1BGYUvKztfzSGDGwcfTrs6DoS_EUUvaeTHReeYFAEKvc/edit?usp=sharing his slides] for all of the details. Some broad points:
+
=== Compiler upgrade discussion ===
* We have moved to two-hour jobs rather than the 6 to 8 hour jobs run in past campaigns. This gives us much better access to the backfill mechanism at NERSC.
+
* The move involved significant development to our workflow to (a) process only selected parts of our 20 GB raw data files in a single job and (b) recombine the resulting output files to correspond to the original 20 GB file.
+
* One significant challenge has been to run monitoring launches, with their 57 plugins, to give complete ROOT output files for each selection of the raw data file.
+
* David pointed out that the current version of the Oasis image of our software does not support development (i.e., building new versions of software), only running. He has put up a Docker container that remedies this. Richard has also run into this issue; he has put the missing pieces in an undisclosed location on Oasis.
+
** Mark remarked that having a developer-friendly version of Oasis would involve very little work, only real estate on Oasis. He will look into this.
+
* Alex suggested that one way forward is to abandon processing of monitoring launches at NERSC and concentrate on REST file production which uses a smaller set of plugins, plugins that have had better past records of success.
+
  
=== OSG Jobs and mcsmear ===
+
Mark described the issues and possible paths forward for the problem of needing to adopt more advanced, non-default compiler in order to bring in recent versions of third-party-provided software, such as Geant4 and ROOT. See [https://docs.google.com/presentation/d/1e1UpDpI0zc4pUe-lUsnij_10kZ2ItKv6ILvRrChv7GQ/edit?usp=sharing his slides] for the details (three main slides, large font, no plots).
  
Today, Thomas noticed that many jobs were crashing in mcsmear when accessing the SQLite form of the CCDB on the OSG. Two issues:
+
The proximate cause of the discussion is the possibility of upgrading to this year's version of Geant4, which requires GCC 4.9.3, more recent than the default 4.8.5 shipped with CentOS7. We decided on two concrete projects that move us in the right direction:
# What is wrong with the SQLite file?
+
 
# Why is it that mcsmear exits with status code = 0 (i.e., success) after bombing?
+
# Richard mentioned a new package, and has already written [[HOWTO_use_prebuilt_GlueX_software_from_any_linux_user_account_using_cvmfsexec|a HOWTO on cvmfsexec]], that will allow access via CVMFS to the Oasis share of our pre-built software stack from user space. This could greatly simplify the distribution of container-ready software.
We were only able to address the first issue. David noticed that today's SQLite file was a bit smaller than usual, making it suspect. Mark promised to recreate the SQLite file and ship it out via Oasis.
+
# Mark volunteered to build a container for CentOS8, which will use an advanced version of GCC natively. By using such a container, we are guaranteed that all system-supplied software is compatible with the new compiler.
[Added in press: (a) Mark [https://mailman.jlab.org/pipermail/halld-offline/2020-July/004115.html made good on his promise] and regenerated the SQLite file. (b) David had reported this issue via email to Mark early this morning. Suffice it to say that Mark is behind on his email.]
+
 
 +
Mark also showed a fourth slide with musings on how we might automate and improve tests of our software.
 +
 
 +
=== Review of recent issues and pull requests ===
 +
 
 +
We ran down [https://github.com/JeffersonLab/halld_recon/issues?q=is%3Aopen+is%3Aissue the list of halld_recon issues] without significant comment.
 +
 
 +
=== Review of recent discussion on the GlueX Software Help List ===
 +
 
 +
Naomi reminded us that if we see problems posting plots to the logbook, we should send a bug report to Mark Dalton with [https://groups.google.com/g/gluex-software/c/tDLG5qcStjA the info he has requested].

Latest revision as of 19:38, 21 July 2020

GlueX Software Meeting
Tuesday, July 21, 2020
3:00 pm EDT
BlueJeans: 968 592 007

Agenda

  1. Announcements
    1. New version set, version_4.23.1.xml
  2. Review of Minutes from the Last Software Meeting (all)
  3. Report from the Last HDGeant4 Meeting (all)
  4. Compiler upgrade discussion (all)
  5. Review of recent issues and pull requests:
    1. halld_recon
    2. halld_sim
    3. CCDB
    4. RCDB
  6. Review of recent discussion on the GlueX Software Help List (all)
  7. Action Item Review (all)

Minutes

Present: Alex Austregesilo, Thomas Britton, Sean Dobbs, Mark Ito (chair), Richard Jones, Naomi Jarvis, David Lawrence, Susan Schadmand, Simon Taylor, Nilanga Wickramaarachchi, Beni Zihlmann

There is a recording of his meeting on the BlueJeans site. Use your JLab credentials to authenticate.

Announcements

New version set, version_4.23.1.xml. The latest version set came out last Wednesday.

Review of Minutes from the Last Software Meeting

We went over the minutes from the meeting on July 7.

NERSC Status

David gave us the run-down on the preparations for the next reconstruction launch at NERSC.

  • He has gone to a roster of plugins intermediate between the original 57, but more that the minimum used in REST production.
  • Igal Jaegle has looked at the latest round of monitoring histograms. They look good. Any missing plots were due to dropped plugin-ins.
  • Overall, the production system is ready to go.
  • David submitted 1,000 jobs recently. These represent a complete run. There were some SWIF2 issues that needed attention from Chris Larrieu.
  • The main outstanding issue is creating a complete set of fiducial times in the CCDB.
    • The slope of event time vs. the 250 MHz clock seems fine taking the nominal 250 MHz. The offsets need to be filled in.
    • Beam trip information is also missing from the CCDB. About 1 in 7 runs is missing presently.
    • There is an issue with SWIF2 where only the first of the ten jobs associated with a single raw data file succeeds. The following nine require re-submission. We could live with this, but not happily. Chris is working on this.

Developer-Friendly Container Build

Mark made small adjustments to the rsync of our container software to Oasis that allows building of halld_recon against Oasis. See his recent email for more details.

Corrupt CCDB SQLite Files

Mark changed the limit on the output file-size check when SQLite versions of the CCDB are produced. It is unlikely that the "Lost connection to MySQL server" will corrupt the file on Oasis anytime soon. If the file is not big enough, the previous version will not be replaced.

Report from the Last HDGeant4 Meeting

  1. We went over the minutes from the HDGeant4 Meeting on July 7. Thomas reported that he has seen the missing normal-error that Richard reported at that meeting. It seems to only occur for certain runs (simulations of specific real runs), but not always on the same event.

Compiler upgrade discussion

Mark described the issues and possible paths forward for the problem of needing to adopt more advanced, non-default compiler in order to bring in recent versions of third-party-provided software, such as Geant4 and ROOT. See his slides for the details (three main slides, large font, no plots).

The proximate cause of the discussion is the possibility of upgrading to this year's version of Geant4, which requires GCC 4.9.3, more recent than the default 4.8.5 shipped with CentOS7. We decided on two concrete projects that move us in the right direction:

  1. Richard mentioned a new package, and has already written a HOWTO on cvmfsexec, that will allow access via CVMFS to the Oasis share of our pre-built software stack from user space. This could greatly simplify the distribution of container-ready software.
  2. Mark volunteered to build a container for CentOS8, which will use an advanced version of GCC natively. By using such a container, we are guaranteed that all system-supplied software is compatible with the new compiler.

Mark also showed a fourth slide with musings on how we might automate and improve tests of our software.

Review of recent issues and pull requests

We ran down the list of halld_recon issues without significant comment.

Review of recent discussion on the GlueX Software Help List

Naomi reminded us that if we see problems posting plots to the logbook, we should send a bug report to Mark Dalton with the info he has requested.