Difference between revisions of "GlueX Software Meeting, August 4, 2020"

From GlueXWiki
Jump to: navigation, search
(Created page with "GlueX Software Meeting<br> Tuesday, August 4, 2020<br> 3:00 pm EDT<br> BlueJeans: [http://www.bluejeans.com/968592007 968 592 007] ==Agenda== # Announcements # Review of G...")
 
(minutes added)
 
(4 intermediate revisions by the same user not shown)
Line 7: Line 7:
  
 
# Announcements
 
# Announcements
 +
## [https://mailman.jlab.org/pipermail/halld-offline/2020-July/008283.html Draft of DSelector documentation] (Sean)
 
# Review of [[GlueX Software Meeting, July 21, 2020#Minutes|Minutes from the Last Software Meeting]] (all)
 
# Review of [[GlueX Software Meeting, July 21, 2020#Minutes|Minutes from the Last Software Meeting]] (all)
 
<!-- # [[HDGeant4_Meeting, July 14, 2020#Minutes|Report from the Last HDGeant4 Meeting]] (all) -->
 
<!-- # [[HDGeant4_Meeting, July 14, 2020#Minutes|Report from the Last HDGeant4 Meeting]] (all) -->
# Restoration of Execution Tests for Pull Request Builds (Sean)
+
# [https://mailman.jlab.org/pipermail/halld-offline/2020-July/008272.html Restoration of Execution Tests for Pull Request Builds] (Sean)
 
# Sluggish Response on halldweb.jlab.org (all)
 
# Sluggish Response on halldweb.jlab.org (all)
# Python 3 Compatible Build System (Mark)
+
# [https://docs.google.com/presentation/d/16L5cOT_Eh3a0H11I9oQvBKPFmkL46x-7-4NHswDglrw/edit?usp=sharing Python 3 Compatible Build System] (Mark)
 
# dE/dx theta Correction (Naomi)
 
# dE/dx theta Correction (Naomi)
 
# Review of recent issues and pull requests:
 
# Review of recent issues and pull requests:
Line 31: Line 32:
 
== Minutes ==
 
== Minutes ==
  
Present: Alex Austregesilo, Thomas Britton, Sean Dobbs, Mark Ito (chair), Richard Jones, Naomi Jarvis, David Lawrence, Susan Schadmand, Simon Taylor, Nilanga Wickramaarachchi, Beni Zihlmann
+
Present: Alex Austregesilo, Thomas Britton, Mark Ito (chair), Richard Jones, Naomi Jarvis, David Lawrence, Susan Schadmand, Beni Zihlmann
  
There is [https://bluejeans.com/s/0rIPo9NbCrd/ a recording of his meeting] on the BlueJeans site. Use your JLab credentials to authenticate.
+
There is [https://bluejeans.com/s/NwcD67rD2@l/ a recording of his meeting] on the BlueJeans site. Use your JLab credentials to authenticate.
  
 
=== Announcements ===
 
=== Announcements ===
  
[https://mailman.jlab.org/pipermail/halld-offline/2020-July/008264.html New version set, version_4.23.1.xml]. The latest version set came out last Wednesday.
+
[https://mailman.jlab.org/pipermail/halld-offline/2020-July/008283.html Draft of DSelector documentation]. See Beni for the link to edit the Overleaf document.
  
 
=== Review of Minutes from the Last Software Meeting ===
 
=== Review of Minutes from the Last Software Meeting ===
  
We went over [[GlueX Software Meeting, July 7, 2020#Minutes|the minutes from the meeting on July 7]].
+
We went over [[GlueX Software Meeting, July 21, 2020#Minutes|the minutes from July 21]].
  
==== NERSC Status ====
+
==== Corrupt CCDB SQLite Files ====
  
David gave us the run-down on the preparations for the next reconstruction launch at NERSC.
+
Mark reported that instances of corrupt CCDB SQLite files have occurred several times over the past two weeks. Recall that the error is  "Lost connection to MySQL server." The new size requirement has been catching them and bad ones are not getting shipped to Oasis.
  
* He has gone to a roster of plugins intermediate between the original 57, but more that the minimum used in REST production.
+
==== Compiler upgrade discussion ====
* Igal Jaegle has looked at the latest round of monitoring histograms. They look good. Any missing plots were due to dropped plugin-ins.
+
* Overall, the production system is ready to go.
+
* David submitted 1,000 jobs recently. These represent a complete run. There were some SWIF2 issues that needed attention from Chris Larrieu.
+
* The main outstanding issue is creating a complete set of fiducial times in the CCDB.
+
** The slope of event time vs. the 250 MHz clock seems fine taking the nominal 250 MHz. The offsets need to be filled in.
+
** Beam trip information is also missing from the CCDB. About 1 in 7 runs is missing presently.
+
** There is an issue with SWIF2 where only the first of the ten jobs associated with a single raw data file succeeds. The following nine require re-submission. We could live with this, but not happily. Chris is working on this.
+
  
==== Developer-Friendly Container Build ====
+
Mark reported that [[HOWTO use prebuilt GlueX software from any linux user account using cvmfsexec|Richard's HOWTO]] on installing and running [https://github.com/cvmfs/cvmfsexec cvmfsexec] allowed him to easily install Oasis on Mark's RHEL7 box at the Lab. This, coupled with our standard container, allows almost instant access to a JLab-like development/running environment, as advertised. With a CentOS 8 container, which is in the works, it could ease our transition to a more advanced version of GCC.
  
Mark made small adjustments to the rsync of our container software to Oasis that allows building of halld_recon against Oasis. See [https://mailman.jlab.org/pipermail/halld-offline/2020-July/008257.html his recent email] for more details.
+
=== Restoration of Execution Tests for Pull Request Builds ===
  
==== Corrupt CCDB SQLite Files ====
+
[https://mailman.jlab.org/pipermail/halld-offline/2020-July/008272.html This]] is working now.
 +
Sean Dobbs may have more to say at the next meeting. Mark pointed out that there is new environment set-up scheme to ensure consistency between building and running the binary tests.
  
Mark changed the limit on the output file-size check when SQLite versions of the CCDB are produced. It is unlikely that the "Lost connection to MySQL server" will corrupt the file on Oasis anytime soon. If the file is not big enough, the previous version will not be replaced.
+
=== Sluggish Response on halldweb.jlab.org ===
  
=== Report from the Last HDGeant4 Meeting ===
+
Several people have been noticing periods of slow response from our main webserver, halldweb.jlab.org, including, not not limited to, use of the wikis. Yesterday morning the server was timing out on requests, not good at all. Mark reported that during slow-downs, the webserver has plenty of idle CPU cycles and does not appear to be swapping. It is the case that the majority of web requests during these times are from the MCwrapper Dashboard, at the level of a few Hertz from multiple browser clients. Thomas assured us that those operations are light weight and cannot account for the slow-downs. Going forward:
  
# We went over [[HDGeant4_Meeting, July 14, 2020#Minutes|the minutes from the HDGeant4 Meeting on July 7]]. Thomas reported that he has seen the missing normal-error that Richard reported at that meeting. It seems to only occur for certain runs (simulations of specific real runs), but not always on the same event.
+
* Thomas has increased the period between updates requests from the browser application, despite his assertion that those requests cannot possibly be the problem, at Mark's request.
 +
* Naomi suggested that people submit ServiceNow requests (write an email to helpdesk@jlab.org) whenever a problem is encountered. That might raise the visibility of the issue.
 +
* Mark mentioned the possibility of putting up a dedicated server, either a webserver, database server, or both, to move the load away from other essential functions on halldweb.
 +
* Mark also suggested that the Computer Center implement some sort of history mechanism that might help identify the bottleneck.
  
=== Compiler upgrade discussion ===
+
=== dE/dx theta Correction ===
  
Mark described the issues and possible paths forward for the problem of needing to adopt more advanced, non-default compiler in order to bring in recent versions of third-party-provided software, such as Geant4 and ROOT. See [https://docs.google.com/presentation/d/1e1UpDpI0zc4pUe-lUsnij_10kZ2ItKv6ILvRrChv7GQ/edit?usp=sharing his slides] for the details (three main slides, large font, no plots).
+
Naomi reported that her improvements to the CDC dE/dx measurement located on [https://github.com/JeffersonLab/halld_recon/tree/nsj_dedx_theta_correction this branch] are ready to go. She expressed her concern that the dE/dx quantities would undergo a sudden change if this branch were merged, making those quantities inconsistent with those encoded in REST files from previous reconstruction launches. Mark said that similar improvements are merged all the time. Beni gave a strong suggestion that the pull request be composed. [Added in press: Naomi submitted the pull request and Beni merged it with the comment "too good to not have."]
  
The proximate cause of the discussion is the possibility of upgrading to this year's version of Geant4, which requires GCC 4.9.3, more recent than the default 4.8.5 shipped with CentOS7. We decided on two concrete projects that move us in the right direction:
+
=== Python 3 Compatible Build System ===
  
# Richard mentioned a new package, and has already written [[HOWTO_use_prebuilt_GlueX_software_from_any_linux_user_account_using_cvmfsexec|a HOWTO on cvmfsexec]], that will allow access via CVMFS to the Oasis share of our pre-built software stack from user space. This could greatly simplify the distribution of container-ready software.
+
Mark described the changes, [https://mailman.jlab.org/pipermail/halld-offline/2020-July/008279.html announced last week], that allows us to build our software on either a Python-2-based system or on one based on Python 3. See [https://docs.google.com/presentation/d/16L5cOT_Eh3a0H11I9oQvBKPFmkL46x-7-4NHswDglrw/edit?usp=sharing his slides] for details. This work is a first step at developing a container system for CentOS 8.
# Mark volunteered to build a container for CentOS8, which will use an advanced version of GCC natively. By using such a container, we are guaranteed that all system-supplied software is compatible with the new compiler.
+
 
+
Mark also showed a fourth slide with musings on how we might automate and improve tests of our software.
+
  
 
=== Review of recent issues and pull requests ===
 
=== Review of recent issues and pull requests ===
  
We ran down [https://github.com/JeffersonLab/halld_recon/issues?q=is%3Aopen+is%3Aissue the list of halld_recon issues] without significant comment.
+
David called our attention to halld_recon Issue #418, '''hd_root hangs at the end of evio file with is_valid_run_end = false''', originally submitted by Naomi. Richard will have a look.
  
 
=== Review of recent discussion on the GlueX Software Help List ===
 
=== Review of recent discussion on the GlueX Software Help List ===
  
Naomi reminded us that if we see problems posting plots to the logbook, we should send a bug report to Mark Dalton with [https://groups.google.com/g/gluex-software/c/tDLG5qcStjA the info he has requested].
+
We went over two items:
 +
 
 +
* [https://groups.google.com/g/gluex-software/c/tDLG5qcStjA JLab logbook image upload problem]. We heard that Mark Dalton has received enough feedback to proceed.
 +
* [https://groups.google.com/g/gluex-software/c/FFKZCBWIAgI Simulation stuck at first event]. Igal Jaegle is still having this problem. Richard will take a look.

Latest revision as of 19:21, 4 August 2020

GlueX Software Meeting
Tuesday, August 4, 2020
3:00 pm EDT
BlueJeans: 968 592 007

Agenda

  1. Announcements
    1. Draft of DSelector documentation (Sean)
  2. Review of Minutes from the Last Software Meeting (all)
  3. Restoration of Execution Tests for Pull Request Builds (Sean)
  4. Sluggish Response on halldweb.jlab.org (all)
  5. Python 3 Compatible Build System (Mark)
  6. dE/dx theta Correction (Naomi)
  7. Review of recent issues and pull requests:
    1. halld_recon
    2. halld_sim
    3. CCDB
    4. RCDB
  8. Review of recent discussion on the GlueX Software Help List (all)
  9. Action Item Review (all)

Minutes

Present: Alex Austregesilo, Thomas Britton, Mark Ito (chair), Richard Jones, Naomi Jarvis, David Lawrence, Susan Schadmand, Beni Zihlmann

There is a recording of his meeting on the BlueJeans site. Use your JLab credentials to authenticate.

Announcements

Draft of DSelector documentation. See Beni for the link to edit the Overleaf document.

Review of Minutes from the Last Software Meeting

We went over the minutes from July 21.

Corrupt CCDB SQLite Files

Mark reported that instances of corrupt CCDB SQLite files have occurred several times over the past two weeks. Recall that the error is "Lost connection to MySQL server." The new size requirement has been catching them and bad ones are not getting shipped to Oasis.

Compiler upgrade discussion

Mark reported that Richard's HOWTO on installing and running cvmfsexec allowed him to easily install Oasis on Mark's RHEL7 box at the Lab. This, coupled with our standard container, allows almost instant access to a JLab-like development/running environment, as advertised. With a CentOS 8 container, which is in the works, it could ease our transition to a more advanced version of GCC.

Restoration of Execution Tests for Pull Request Builds

This] is working now. Sean Dobbs may have more to say at the next meeting. Mark pointed out that there is new environment set-up scheme to ensure consistency between building and running the binary tests.

Sluggish Response on halldweb.jlab.org

Several people have been noticing periods of slow response from our main webserver, halldweb.jlab.org, including, not not limited to, use of the wikis. Yesterday morning the server was timing out on requests, not good at all. Mark reported that during slow-downs, the webserver has plenty of idle CPU cycles and does not appear to be swapping. It is the case that the majority of web requests during these times are from the MCwrapper Dashboard, at the level of a few Hertz from multiple browser clients. Thomas assured us that those operations are light weight and cannot account for the slow-downs. Going forward:

  • Thomas has increased the period between updates requests from the browser application, despite his assertion that those requests cannot possibly be the problem, at Mark's request.
  • Naomi suggested that people submit ServiceNow requests (write an email to helpdesk@jlab.org) whenever a problem is encountered. That might raise the visibility of the issue.
  • Mark mentioned the possibility of putting up a dedicated server, either a webserver, database server, or both, to move the load away from other essential functions on halldweb.
  • Mark also suggested that the Computer Center implement some sort of history mechanism that might help identify the bottleneck.

dE/dx theta Correction

Naomi reported that her improvements to the CDC dE/dx measurement located on this branch are ready to go. She expressed her concern that the dE/dx quantities would undergo a sudden change if this branch were merged, making those quantities inconsistent with those encoded in REST files from previous reconstruction launches. Mark said that similar improvements are merged all the time. Beni gave a strong suggestion that the pull request be composed. [Added in press: Naomi submitted the pull request and Beni merged it with the comment "too good to not have."]

Python 3 Compatible Build System

Mark described the changes, announced last week, that allows us to build our software on either a Python-2-based system or on one based on Python 3. See his slides for details. This work is a first step at developing a container system for CentOS 8.

Review of recent issues and pull requests

David called our attention to halld_recon Issue #418, hd_root hangs at the end of evio file with is_valid_run_end = false, originally submitted by Naomi. Richard will have a look.

Review of recent discussion on the GlueX Software Help List

We went over two items: