Difference between revisions of "Offline Monitoring Archived Data"

From GlueXWiki
Jump to: navigation, search
(Created page with "== Offline Monitoring: Running Over Archived Data == Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online moni...")
 
(Offline Monitoring: Running Over Archived Data)
Line 1: Line 1:
== Offline Monitoring: Running Over Archived Data ==
+
== Overview ==
  
 
Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.  
 
Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software.  
Line 15: Line 15:
 
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.
 
is handled in the section [[#Post-Processing_Procedures | Post-Processing Procedures]] below.
  
<!--
+
== General Information on Procedures ==
==== Using cron to run automatically ====
+
Within /home/gluex/halld/monitoring/cron/ there is a file cron_plugins
+
that can be executed via
+
crontab cron_plugins
+
This will set up a cron job to call the script scan_for_jobs.sh, which will
+
check in the rawdata directory and call generatejobs_plugins_rawdata.sh for
+
any run that is more than 5 min old. The cron job is set up to run every 10 min.
+
-->
+
 
+
=== General Information on Procedures ===
+
 
Since we may want to simultaneously run offline monitoring for different run periods that require
 
Since we may want to simultaneously run offline monitoring for different run periods that require
 
different environment variables, the scripts are set up so that a generic user can download the
 
different environment variables, the scripts are set up so that a generic user can download the
Line 44: Line 34:
 
  https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/
 
  https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/
  
=== Preparing the software for the launch ===
+
== Preparing the software for the launch ==
  
 
1. Update the environment, using the latest desired versions of JANA, the CCDB, etc.  Also, the launch software will create new tags of the HDDS and sim-recon repositories, so update the version*.xml file referenced in the environment file to use the soon-to-be-created tags.  This must be done <b>BEFORE</b> launch project creation. The environment file is at:  
 
1. Update the environment, using the latest desired versions of JANA, the CCDB, etc.  Also, the launch software will create new tags of the HDDS and sim-recon repositories, so update the version*.xml file referenced in the environment file to use the soon-to-be-created tags.  This must be done <b>BEFORE</b> launch project creation. The environment file is at:  
Line 74: Line 64:
 
</pre>
 
</pre>
  
=== Starting the Launch and Submitting Jobs ===
+
== Starting a new run period ==
 +
When a new run period is started, it is best to make sure all top-level directories are created with the right permissions. This can save headaches later on when a different gxprojN account is used for offline monitoring.
 +
# Create top-level directories<pre> /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/</pre> <pre>/group/halld/data_monitoring/run_conditions/RunPeriod-20YY-MM/</pre>
 +
 
 +
# Make sure other gxprojN users can write in with chmod g+w [dir name]. Check that permissions match those from previous run periods.
 +
 
 +
# Since the publish-to-web scripts are webpage-update scripts, you need to have pre-existing, template versions of a few files for the new run period.  It also expects to find comment hooks that include the new run period name, so make sure those are edited in the new template files.
 +
 
 +
== Starting the Launch and Submitting Jobs ==
  
 
1. Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/monitoring/hdswif.  
 
1. Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/monitoring/hdswif.  
Line 129: Line 127:
 
To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>
 
To submit all jobs after checking the results, do <pre>hdswif.py run [workflow]</pre>
  
=== Checking the Status and Resubmitting ===
+
== Checking the Status and Resubmitting ==
 
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away.  Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].
 
1. The status of jobs can be checked on the terminal with <pre>jobstat -u gxprojN</pre> For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use <pre>swif list</pre> or for more information, <pre>swif status [workflow] -summary</pre> Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away.  Also see the Auger [http://scicomp.jlab.org/scicomp/#/auger/jobs job website].
  
Line 224: Line 222:
 
<br style="clear:both;"/>
 
<br style="clear:both;"/>
  
=== Post-analysis of statistics of the launch ===
+
== Post-analysis of statistics of the launch ==
  
 
After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
 
After jobs have been submitted, it will usually take a few days for all of the jobs to be processed.
Line 237: Line 235:
 
# <b>Backing up SWIF output</b><br> With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/monitoring/hdswif/xml/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>
 
# <b>Backing up SWIF output</b><br> With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do <pre>cp ~/monitoring/hdswif/xml/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ </pre>
  
=== Cross Analysis of Launches ===
+
== Cross Analysis of Launches ==
  
 
The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
 
The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches.
Line 252: Line 250:
 
## Run <pre>python create_stats_for_each_file.py [RUNPERIOD] [MINVERSION] [VERSION]</pre> This creates a comparison table of the final state and problems for each file between launches [MINVERSION] and [VERSION]. In the run_cross_analysis.sh script, the default is to set MINVERSION to be 15 for run period 2015_03, but as long as SWIF was used for all previous launches, any number will work.
 
## Run <pre>python create_stats_for_each_file.py [RUNPERIOD] [MINVERSION] [VERSION]</pre> This creates a comparison table of the final state and problems for each file between launches [MINVERSION] and [VERSION]. In the run_cross_analysis.sh script, the default is to set MINVERSION to be 15 for run period 2015_03, but as long as SWIF was used for all previous launches, any number will work.
 
## Run <pre>python create_resource_correlation_plots.py [RUNPERIOD] [CMPMINVERSION] [VERSION]</pre> This creates correlation plots of resource use between launches between CMPMINVERSION and VERSION. By default CMPMINVERSION is 5 launches earlier.
 
## Run <pre>python create_resource_correlation_plots.py [RUNPERIOD] [CMPMINVERSION] [VERSION]</pre> This creates correlation plots of resource use between launches between CMPMINVERSION and VERSION. By default CMPMINVERSION is 5 launches earlier.
 
=== Starting a new run period ===
 
When a new run period is started, it is best to make sure all top-level directories are created with the right permissions. This can save headaches later on when a different gxprojN account is used for offline monitoring.
 
# Create top-level directories<pre> /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/</pre> <pre>/group/halld/data_monitoring/run_conditions/RunPeriod-20YY-MM/</pre>
 
 
# Make sure other gxprojN users can write in with chmod g+w [dir name]. Check that permissions match those from previous run periods.
 
 
# Since the publish-to-web scripts are webpage-update scripts, you need to have pre-existing, template versions of a few files for the new run period.  It also expects to find comment hooks that include the new run period name, so make sure those are edited in the new template files.
 

Revision as of 10:10, 3 February 2016

Overview

Once files are written to tape we run the online plugins on these files to confirm what we were seeing in the online monitoring, and also to update the results from the latest calibration and software. Manual scripts and cron jobs are set up to look for new data and run the plugins over a sample of files.

Every other Friday (usually the Friday before the offline meetings) jobs will be started to run the newest software on all previous runs,and allows everybody to see improvements in each detector. For each launch, independent builds of hdds, sim-recon, the monitoring plugins, and an sqlite file will be generated.

Below the procedures are described for

  1. Preparing the software for the launch
  2. Starting the launch (using hdswif)
  3. Post-analysis of statistics of the launch

Processing the results and making them available to the collaboration is handled in the section Post-Processing Procedures below.

General Information on Procedures

Since we may want to simultaneously run offline monitoring for different run periods that require different environment variables, the scripts are set up so that a generic user can download the scripts and run them from anywhere. Most output directories for offline monitoring are created with group read/write permissions so that any Hall D group user has access to the contents, but there are some cases where use of the account that created the launch is necessary.

The accounts used for offline monitoring are the gxprojN accounts created and maintained by Mark Ito (see here for how each account is used). As of October 2015, the following are used:

  • gxproj1 for running over incoming experimental data (as it hits the tape)
  • gxproj5 for running over previous experimental data (biweekly launches)

For offline monitoring, the hdswif system that Kei developed is used for launching the jobs, and a new cross analysis system based on MySQL and Python is maintained.

The scripts for the monitoring are maintained in svn:

https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/

Preparing the software for the launch

1. Update the environment, using the latest desired versions of JANA, the CCDB, etc. Also, the launch software will create new tags of the HDDS and sim-recon repositories, so update the version*.xml file referenced in the environment file to use the soon-to-be-created tags. This must be done BEFORE launch project creation. The environment file is at:

~/env_monitoring_launch
2. Setup the environment. This will override the HDDS and sim-recon in the version*.xml file and will instead use the monitoring launch working-area builds. Call:
source ~/env_monitoring_launch

3. Updating & building hdds:

cd $HDDS_HOME
git pull                # Get latest software
scons -c install        # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4       # Rebuild and re-install with 4 threads

4. Updating & building sim-recon:

cd $HALLD_HOME/src
git pull                # Get latest software
scons -c install        # Clean out the old install: EXTREMELY IMPORTANT for cleaning out stale headers
scons install -j4       # Rebuild and re-install with 4 threads

5. Create a new sqlite file containing the very latest calibration constants. Original documentation on creating sqlite files are here.

cd $GLUEX_MYTOP/../sqlite/
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhallddb.jlab.org -uccdb_user ccdb | sqlite3 ccdb.sqlite
mv ccdb.sqlite ccdb_monitoring_launch.sqlite #replacing the old file

Starting a new run period

When a new run period is started, it is best to make sure all top-level directories are created with the right permissions. This can save headaches later on when a different gxprojN account is used for offline monitoring.

  1. Create top-level directories
     /volatile/halld/offline_monitoring/RunPeriod-20YY-MM/
    /group/halld/data_monitoring/run_conditions/RunPeriod-20YY-MM/
  1. Make sure other gxprojN users can write in with chmod g+w [dir name]. Check that permissions match those from previous run periods.
  1. Since the publish-to-web scripts are webpage-update scripts, you need to have pre-existing, template versions of a few files for the new run period. It also expects to find comment hooks that include the new run period name, so make sure those are edited in the new template files.

Starting the Launch and Submitting Jobs

1. Download the hdswif directory from svn. For the gxprojN accounts, use the directory ~/monitoring/hdswif.

svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif
cd hdswif

2. Edit the job config file, input.config, which is used to register jobs in hdswif. A typical config file will look this:

PROJECT                       gluex
TRACK                         reconstruction
OS                            centos65
NCORES                        6
DISK                          40
RAM                           8
TIMELIMIT                     8
JOBNAMEBASE                   offmon_
RUNPERIOD                     2015-03
VERSION                       15
OUTPUT_TOPDIR                 /volatile/halld/offline_monitoring/RunPeriod-[RUNPERIOD]/ver[VERSION] # Example of other  variables included in variable
SCRIPTFILE                    /home/gxproj5/monitoring/hdswif/script.sh                             # Must specify full path
ENVFILE                       /home/gxproj5/env_monitoring_launch                                   # Must specify full path

3. Creating the workflow: Within SWIF jobs are registered into workflows. For offline monitoring, the workflow names are of the form offline_monitoring_RunPeriod20YY_MM_verVV_hd_rawdata with suitable replacements for the run period year YY, month BB, and the version number VV (with leading zeroes). The command "swif list" will list all existing workflows. Also, for most simple SWIF commands hdswif also provides a wrapper.

swif list

For creation of workflows for offline monitoring the command:

hdswif.py create [workflow] -c input.config

should be used. When a config file (here: input.config) is passed in, hdswif will automatically create files that record the configuration of the current launch. These files are stored as for example:

/group/halld/data_monitoring/run_conditions/RunPeriod-2015-03/jana_rawdata_comm_2015_03_ver15.conf
/group/halld/data_monitoring/run_conditions/RunPeriod-2015-03/soft_comm_2015_03_ver15.xml
The software packages stored in git (sim-recon and hdds) can have git tags applied to them, which makes it easier to find versions of the software than a SHA-1 hash. hdswif will ask if you would like to create a tag, and execute the following sequence:
git tag -a offmon-201Y_MM-verVV -m "Used for offline monitoring 201Y-MM verVV started on 201y/mm/dd"
git push offmon-201Y_MM-verVV
This will only be invoked when the user name is gxprojN, and for the configuration files, the output directory will be /group/halld/data_monitoring/run_conditions/ for gxprojN accounts while it will be the current directory for other users.

4. Registering jobs in the workflow: To register jobs within the workflow, hdswif provides the use of config files. Jobs can be registered by specifying the workflow, config file (-c), run (-r) and file (-f) numbers if necessary. Note: Job configuration parameters can be set differently for jobs within the same workflow if necessary.

Jobs can be added via
hdswif.py add [workflow] -c input.config

By default, hdswif will add all files found within the directory /mss/halld/RunPeriod-201Y-MM/rawdata/ where 201Y-MM is specified by the RUNPERIOD parameter in the config file. If only some of the runs or files are needed, these can be specified for example with

hdswif.py add [workflow] -c input.config -r 3180 -f '00[0-4]'

to specify to register running only over run 3180 files 000 - 004 (Unix-style brackets and wildcards can be used).

5. Running the workflow: To run the workflow, simply use the hdswif wrapper:

hdswif.py run [workflow]

It is recommended that some jobs be tested to make sure that everything is working rather than fail thousands of jobs.
Check the configuration setup when creating a workflow, and what is in script.sh. Also check the following:

  • Check stderr files. Are they small (<kB)?
  • Check stdout files. Are they very large (>MB)?
  • Check output ROOT files. Are they larger than several MB?
  • Check output REST files. Are they larger than several tens of MB?


For this purpose, hdswif will take an additional parameter to run which limits the number of jobs to submit:
hdswif.py run [workflow] 10

in which case only 10 jobs will be submitted.

To submit all jobs after checking the results, do
hdswif.py run [workflow]

Checking the Status and Resubmitting

1. The status of jobs can be checked on the terminal with
jobstat -u gxprojN
For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use
swif list
or for more information,
swif status [workflow] -summary
Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger job website. 2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources,
swif retry-jobs [workflow] -problems [problem name]
can be used, and for jobs to be submitted with more resources, e.g., use
swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT

This only re-stages the jobs, be sure to resubmit them with:

swif run -workflow [workflow] -errorlimit none
hdswif has a wrapper for both of these:
hdswif.py resubmit [workflow] [problem]
In this case [problem] can be one of SYSTEM, TIMEOUT, RLIMIT. If SYSTEM is specified, the jobs will be retried. For TIMEOUT and RLIMIT, the jobs will be modified by default with 2 additional hours or GB of RAM. If one more number is added as an option, then that many hours or GB of RAM will be added., e.g.,
hdswif.py resubmit [workflow] TIMEOUT 5
will add 5 hours of processing time. You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands and for hdswif see the attached documentaion in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.

ERROR NAME Description Resolution hdswif command
AUGER-SUBMIT

SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)

If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.

hdswif.py resubmit [workflow] SYSTEM

AUGER-FAILED

Auger reports the job FAILED with no specific details.

Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.

hdswif.py resubmit [workflow] SYSTEM

AUGER-OUTPUT-FAIL

Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.

Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.

hdswif.py resubmit [workflow] SYSTEM

AUGER-INPUT-FAIL

Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)

Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.

hdswif.py resubmit [workflow] SYSTEM

AUGER-TIMEOUT

Job timed out.

If more time is needed for job add more resources. Default is to add 2 hrs of processing time. Also check whether code is hanging.

hdswif.py resubmit [workflow] TIMEOUT
Default is to add 2 hours. Optionally specify number of hours at end.

AUGER-OVER_RLIMIT

Not enough resources, RAM or disk space.

Add more resources for job.

hdswif.py resubmit [workflow] RLIMIT
Default is to add 2 GB of RAM. Optionally specify GB at end. To add more disk space use SWIF directly.

SWIF-MISSING-OUTPUT

Output file specified by user was not found.

Check if output file exists at end of job.

SWIF-USER-NON-ZERO

User script exited with non-zero status code.

Your script exited with non-zero status. Check the code you are running.

SWIF-SYSTEM-ERROR

Job failed owing to a problem with swif (e.g. network connection timeout)

Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.

hdswif.py resubmit [workflow] SYSTEM


Post-analysis of statistics of the launch

After jobs have been submitted, it will usually take a few days for all of the jobs to be processed. The next step is to check the resource usage for the current launch and publish the results online.

  1. Create summary XML, HTML files
    The status and results of jobs are saved within the SWIF internal server, and are available via the command
    swif status [workflow] -summary -runs
    where the arguments -summary and -runs show summary statistics and statistics for individual jobs, respectively. hdswif has a command that takes this output in XML output and creates an HTML webpage showing results of the launch. To do this, do
    hdswif.py summary [workflow]
    This will create the XML output file from SWIF called swif_output_[workflow].xml and create a webpage containing png figure files. If the XML file already exists, hdswif will ask whether to overwrite the existing file.
  2. Publish output files online
    At this stage the html output and figure files are created and ready to be put online. The html output capabilities of hdswif is useful for any user using SWIF, but since publication of the html output is specific to the offline monitoring, the script to do this is contained in the cross_analysis scripts directory at https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis. For the gxprojN accounts this directory should exist as ~/monitoring/cross_analysis. To publish the results online do for example
    python ~/monitoring/cross_analysis/publish_offmon_results.py 2015_03 18
    The script copies the html output and corresponding figures to /group/halld/www/halldweb/html/data_monitoring/launch_analysis/ and also creates a link to it in the html page.
  3. Editing the summary HTML page
    The top page for is offline monitoring https://halldweb.jlab.org/data_monitoring/launch_analysis/index.html and has links to the summary page for each run period. The summary files are
    /group/halld/www/halldweb/html/data_monitoring/launch_analysis/[run period]/[run period].html 
    Edit the file to:
    1. Add a new line to the first table which contains the version number, date, and comments for the current launch
    2. Create a link to the webpage for the current launch. Simply copy and paste, modify the previous launch's link to have the correct launch ver.
  4. Freezing SWIF tables
    Since we are now finished with the SWIF workflow that we used, the workflow should be "frozen" so that it cannot be mistakenly altered or modified. Do
    swif freeze [workflow]
  5. Backing up SWIF output
    With the workflow frozen we should be able to reproduce the XML output from SWIF, but we will backup the XML output just in case. Do
    cp ~/monitoring/hdswif/xml/swif_output_[workflow].xml /group/halld/data_monitoring/swif_xml_backup/ 

Cross Analysis of Launches

The purpose of the cross analysis is to correlate how resource usage changed for the same files across different launches. To do this it is useful to create MySQL tables that contain information on each launch, and do queries across tables.

  1. The scripts to do this are maintained in the svn directory https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis Do
     svn co https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/cross_analysis 
    For the gxprojN accounts used for offline monitoring the directory should be ~/monitoring/cross_analysis
  1. The main script is run_cross_analysis.sh, which can be run with
    ./run_cross_analysis.sh [RUNPERIOD] [VERSION] [MINVERSION]
    , where e.g. [RUNPERIOD] = 2015_03, [VERSION] = 22, and [MINVERSION] = 15. However, it is strongly recommended that the commands in this script be run by hand to catch any errors.
  1. Enter the python commands that are in run_cross_analysis.sh . Below are the steps and explanations:
    1. Create a table for the current launch using
      ./create_cross_analysis_table.sh [RUNPERIOD] [VERSION]
      . The table will be created from the file template_table_schema.sql and contain columns id, run, file, timeChange, cpu_sec, wall_sec, mem_kb, vmem_kb, nevents, input_copy_sec, plugin_sec, final_state, problems
    2. Run
      python fill_cross_analysis_entries.py [RUNPERIOD] [VERSION]
      The script will gather all of the necessary information either from SWIF output or the stdout files for the jobs
    3. Run
      python create_stats_table_row.py [RUNPERIOD] [VERSION]
      This will loop over the jobs in the current launch and create a row in an HTML table that summarizes the statistics for the final state and problems for the jobs. This table row is then inserted into the main HTML webpage for the run period.
    4. Run
      python create_stats_for_each_file.py [RUNPERIOD] [MINVERSION] [VERSION]
      This creates a comparison table of the final state and problems for each file between launches [MINVERSION] and [VERSION]. In the run_cross_analysis.sh script, the default is to set MINVERSION to be 15 for run period 2015_03, but as long as SWIF was used for all previous launches, any number will work.
    5. Run
      python create_resource_correlation_plots.py [RUNPERIOD] [CMPMINVERSION] [VERSION]
      This creates correlation plots of resource use between launches between CMPMINVERSION and VERSION. By default CMPMINVERSION is 5 launches earlier.