GlueX Offline Meeting, January 21, 2015
From GlueXWiki
GlueX Offline Software Meeting
Wednesday, January 14, 2015
1:30 pm EST
JLab: CEBAF Center F326/327
Agenda
- Announcements
- Volatile disk expanded: reservation 10 -> 20 TB, quota 30 -> TB
- Marty Wise working on Run Conditions (Control?) Database (RCDB)
- Computer Center has RHEL7 available for beta testers
- Work disk full
- Review of minutes from January 7 (all)
- Data Challenge 3
- Software Review Preparations
- Commissioning Run Review:
- Offline Monitoring Report (Kei)
- Ran over all files (online plugins, 2-track EVIO skim, REST) 2 weeks ago
- Next launch is this Friday
- Will be testing EventStore to mark events
- Quick update on CentOS65, multithread processing
- Commissioning-branch-to-trunk migration (Simon)
- Handling changing magnetic field settings (Sean)
- Analysis of REST file data (Justin)
- Data Management (Sean)
- Storing software information in REST files
- EVIO format definition for Level 3 trigger farm
- EventStore: implementation plan
- Offline Monitoring Report (Kei)
- Requests to SciComp on farm features (Kei)
- Tools to track jobs:
- tools to track what percentage of nodes were being used by whom at a given time, preferably in both # of jobs and threads.We can see the pie charts for example in http://scicomp.jlab.org/scicomp/#/auger/usage but would like the information in a form that we can easily access and analyze.
- what % of nodes are currently available for each OS at a given time
- tools to track the life time of each stage of the job, such as sitting in queue, waiting for files from tape, running, etc.
- Would it be possible to make the stdout and stderr web-viewable?
- If possible, can you add the ability to search by “job name” (every job that includes the search term) in the auger custom job query website?
- For more general requests:
- better transparency for whether there are problems in the system, such as heavy traffic due to users, broken disks, etc. Could there be an email list/webpage for that information?
- clarification of how 'priority' of jobs works between different halls and users.
- would it be possible for the system to auto-resubmit failed jobs if the failure is on the side of the system (e.g., bad farm nodes, temporary loss of connection)?
- Additionally, ask for more space on cache disk?
- Tools to track jobs:
- HDDM versions and backward compatibility
- Action Item Review
Communication Information
Remote Connection
- The BlueJeans meeting number is 968 592 007 .
- Join the Meeting via BlueJeans
Slides
Talks can be deposited in the directory /group/halld/www/halldweb1/html/talks/2015
on the JLab CUE. This directory is accessible from the web at https://halldweb1.jlab.org/talks/2015/ .