Raid-to-Silo Transfer Strategy
Getting data to the tape library
Files are copied to the tape library in the Computer Center (bottom floor of F-wing) via a multi-stage process. The details are as follows:
- The DAQ system will write data to one of the 4 partitions on gluonraid3
- The partition being written to is changed every run by the script $DAQ_HOME/scripts/run_prestart_sync
- It does this by running /gluex/builds/devel/packages/raidUtils/hd_rotate_raid_links.py which updates the links:
/gluex/data/rawdata/prev <- Link to partition where previous run was written /gluex/data/rawdata/curr <- Link to partition where current(most recent) run is(was) written /gluex/data/rawdata/next <- Link to partition where next run will be written
- Each partition has 3 directory trees used to maintain the data in various stages as it is copied to tape and kept for potential offline analysis within the gluon cluster
/gluonraid3/dataX/rawdata/active <- directory tree data is wrritten to by DAQ /gluonraid3/dataX/rawdata/volatile <- directory tree is moved to for later analysis on gluons /gluonraid3/dataX/rawdata/staging <- directory tree with files hard linked to volatile for copying to tape
- A series of cron jobs on gluonraid3 in the hdsys account moves and links the data among these directories.
- These cron jobs are based on 4 scripts:
/gluex/builds/devel/packages/raidUtils/hd_stage_to_tape.py /gluex/builds/devel/packages/raidUtils/hd_link_rundirs.py /gluex/builds/devel/packages/raidUtils/hd_disk_map_and_free.py /gluex/builds/devel/packages/raidUtils/hd_copy_sample.py
- Current details of how these do this should be referred to the scripts themselves which have extensive comments at the top describing what they do. They are located [in subversion]. Here is an overview:
This script will search for completed runs in the "active" directories of all partitions. For any it finds it will:
- Move the run directory to "volatile"
- Create a tar file of any subdirectories (e.g. RunLog, monitoring, ...)
- Make a hard link to every evio file and tar file in the "staging" directory
- Run the jmigrate program (provided by Scientific Computing Group) to copy files from "staging" to tape
This script will make a symbolic link in /gluex/data/rawdata/all for each run directory. It does this for all partitions so there is a single location one needs to look to find all runs currently available across all partitions. It will also remove dead links pointing to run directories that no longer exist.
This script will identify partitions not currently in use (or in potential danger of being in use soon) and run the map_disk.py utility on those. It will also run the map_disk_autodelete.py utility to ensure adequate disk space is available for another run.
This script will copy the first 3 data files from each run into the "volatile" directory on gluonraid2. This will allow a sample of files for all runs in a RunPeriod to be kept on the gluon cluster.
- Files will be written to subdirectories containing both the RUN_PERIOD and run number. For example:
- where RUN_PERIOD is an environment variable set in the /gluex/etc/hdonline.cshrc script and XXXXXX is the 6-digit, zero-padded run number.
- The RunXXXXXX directory will contain:
- all raw data files for the run
- a tar file of the DAQ settings
- a monitoring directory containing the online monitoring histograms generated for the run.
- Stub files referring to the tape-resident files will be placed in the following directory on the JLab CUE:
- On the CUE (and therefore in the tape library), the files are owned by the halldata account.
Disk space on RAID
Standard running will use only the gluonraid3 sever. This is broken into 4 partitions of size 54TB each. However, based on the advise of Chip, we will routinely only use 80% of these disks in order to optimize the write/read rates by only utilizing the outer portions of the disks. Further, when a partition is not currently in use, it is subject to preparation for the next run which means an additional 5.76TB of space will be freed. (the 5.76TB is based on 800MB/s for a 2 hour run). Thus, 37.6TB of space will be in use on each partition once the disk is initially filled.
In addition, there are two 72TB RAID disks in the counting house named gluonraid1 and gluonraid2. (They actually report to be 77TB, but only about 72-73TB is usable.) These are primarily used for storing secondary data streams (PXI data from the magnet, sample data from each run, ...). However, the space will be used for additional storage if the need arises such as losing the connection to the tape library.
Monitoring available space
- The currently selected RAID disk should always be linked from /gluex/raid
- The current RAID disk is monitored via the EPICS alarm system. The remaining space is written to an EPICS variable via the program hd_raidmonitor.py. This is always run by the hdops account on gluonraid1 regardless of whether it is the currently selected disk. This actually updates two EPICS variables:
- HD:coda:daq:availableRAID - The remaining disk space on the RAID in TB
- HD:coda:daq:heartbeat - Updated every 2 seconds to switch between 0 and 1 indicating the hd_raidmonitor.py program is still running
- The system is set to raise a warning if the available space drops below a few TB (ask Hovanes for exact number)
- The system is set to raise an error if the available space drops below ??TB or the heartbeat stops updating
- Note that at this time, the hd_raidmonitor.py is not automatically started by any system. Therefore, if gluonraid1 is rebooted, the program will need to be restarted "by-hand"
- To check if the hd_raidmonitor.py program is running, enter the following:
- To start the hd_raidmonitor.py utility do this
> $DAQ_HOME/tools/raidutils/hd_raidmonitor.py >& /dev/null &
- Source for Hall-D specific tools used for managing the RAID are kept in svn here:
- A second cron job is run on gluonraid1 from the hdsys account to remove files from the gluonraid1/rawdata/volatile directory as needed to ensure disk space is available
- (n.b. at the time of this writing, the second cron job has not been set up!)
Changing Run Periods
We will update to a new Run Period at the beginning of each beam period. This means in between beam times when we take mainly cosmic data, data will go into the previous Run Period's directory. A month or two before a new Run Period, the volume sets should get set up so that cosmic and calibration data taken just prior to the run will go to the right place. To begin a new run period one must do the following:
- Make sure no one is running the DAQ or is at least aware that the run period (and output directory) is about to change. They will probably want to log out and back in after you're done to make sure the environment in all windows is updated.
- Submit a Service Now ticket requesting two new tape volume sets be created.
- Go to JLab Service Now website
- Click on "Create Incident"
- For Category select Scientific Computing
- For Urgency select 2-Medium
- For the issue description enter something like:
We would like to have the following two tape volume sets created please:
where the first should be type "raw" and the second type "production"
n.b. some details on how this has been done before can be found in Service Now request 18907
- You should specify whether each volume is to be flagged "raw" or "production". The "raw" volumes are will be automatically duplicated by the Computer Center. The "production" volumes will not. We generally want all Run Periods to be "raw", even times when only cosmic data is taken as that is considered a valuable snapshot of detector performance at a specific point in time.
- To check the current list of halld tape volume sets:
- Go to the SciComp Home page
- On the left side menu, in the "Tape Library" section, click on Usage->"Accumulation"
- In the main content part of the window, you should now be able to click the arrow next to "halld" to open a submenu
- In the submenu, click the arrow next to "raw" to see the existing volume sets
- (n.b. The sets listed will be limited to those written to during the dates shown in the calendar in the upper right corner.)
- Modify the file /gluex/etc/hdonline.cshrc to set RUN_PERIOD to the new Run Period name.
- Create the Run Period directory on all raid disks (gluonraid1 , gluonraid2, gluonraid3, and gluonraid4) by logging in as hdops and doing the following:
- for gluonraid1 and gluonraid2:
> mkdir /raid/rawdata/active/$RUN_PERIOD
> chmod o+w /raid/rawdata/active/$RUN_PERIOD
- for gluonraid3 and gluonraid4:
> mkdir /data1/rawdata/active/$RUN_PERIOD
> mkdir /data2/rawdata/active/$RUN_PERIOD
> mkdir /data3/rawdata/active/$RUN_PERIOD
> mkdir /data4/rawdata/active/$RUN_PERIOD
> chmod o+w /data?/rawdata/active/$RUN_PERIOD
- Modify the crontab for root on both gluonraid1 and gluonraid2. Do this by editing the following files (as hdsys) to reflect the new Run Period directory:
- when done, the relevant line2 should look something like this:
*/10 * * * * /root/jmigrate /raid/rawdata/staging/RunPeriod-2019-01 /raid/rawdata/staging /mss/halld
- this will run the jmigrate script every 10 minutes, copying every file found in the /raid/rawdata/staging/RunPeriod-2019-01 directory to the tape library, preserving the directory structure, but replacing the leading /raid/rawdata/staging with /mss/halld. It will also unlink the files and remove any empty directories.
- install it as root on both gluonraid1 and gluonraid2 via:
# crontab /gluex/etc/crontabs/crontab.root.gluonraid
- n.b. you do NOT need to update the root crontab on gluonraid3 or gluonraid4. Those run jmigrate from the hdsys account cronjob via the hd_stage_to_tape.py script.
- Change the run number by rounding up to the next multiple of 10,000.
- Make sure the DAQ system is not running
- Using the hdops account, edit the file $COOL_HOME/hdops/ddb/controlSessions.xml and set the new run number.
--- Here is Archival Information you are probably not interested in (Click "Expand" to the right to see -->) ---
The following is background information on how the system was first discussed and set up. It still may contain some useful info so it is kept here.
e-mail describing RAID->Tape
The portion of the system that copies files from the RAID disk to the tape library was setup by Paul Letta with the help of Chris Larrieu and Kurt Strosahl. Below is an e-mail from Paul sent on June 30, 2014 describing this.
Hi David, On gluonraid1: Anything under /raid/rawdata/staging ... will go to tape with the path /mss/halld/halld-scratch/rawdata (there is lots of garbage under /mss/halld/halld-scratch.....) And anything that goes under /mss/halld/halld-scratch/ goes to the hall d scratch tapes. Jmigrate runs every 10 mins. It deletes the files after they have gone to tape. It also deletes empty directories. So the idea was for the data acq software to write to /raid/rawdata/active. Once a file is complete, create a hard link in /raid/rawdata/staging. Then jmigrate will put the file on tape and delete the hard link. Its up to you to delete files out of /raid/rawdata/active when you are done with them. The mapping of /raid/rawdata/staging/.... to /mss/halld.... is easy to change. Test it out... oh.. it runs with the user gluex in jasmine. Paul
The following is the original documentation written before the system was fully implemented. It still contains some useful information though'
Below is a proposal for a raid-to-silo transfer strategy for moving Hall D data files from our local raid server to the JLab tape storage facility. We will update this as our ideas develop.
- Elliott Wolin
- Dave Lawrence
- We will use the jmirror facility from the Computer Center to transfer the files.
- jmirror deletes the link to the file when the transfer is complete. It does not delete directories, only files.
- jmirror is fairly smart and reliable. It only deletes the hard link when the file is safely transferred.
- CRON jobs will delete unneeded dirs after their contents are safely transferred.
- jmirror is run periodically via a CRON job, it is not a tranfer server system. It transfers files it finds when it is run.
- jmirror will not transfer files actively being written to, nor transfer files twice if invoked twice.
- You cannot have two instances of jmirror running at the same time, they might clash over files that are partially copied.
- Additional hard links to the data file are untouched by jmirror. These can be used to keep the file on disk after transfer.
- If files are kept they will be deleted "just-in-time" to make room for new DAQ files. This will require cleanup strategy and cron scripts to implement it.
- The DAQ creates a 10 GB file every 30 secs, about 1 TB/hour. Thus a two hour run generates 2 TB.
- Files will be queued up for transfer at the end of the run via run control scripts run under the hdops account. Files will be owned by hdops on the raid system.
- Files in the tape storage facility will be owned by the gluex account.
- Mark I. prefers to store files by "run period" with a simple naming scheme (RunPeriod001, RunPeriod002 or similar).
- Run periods are just date ranges. Run numbers will NOT be reused, i.e. all run numbers are unique across all run periods.
- Due to constraints in the mss a second level of directories is needed. Mark and I propose simply organizing files by run, e.g. something like Run000001, Run000002, etc.
- Run files will have the run number in them, e.g something like: Run000001.evio.001, Run000001.evio.002, etc.
- A two-hour run will generate around 250 files.
- The RAID sytem stripes data across all disks, independent of logical partitioning.
- RAID disk partitions do not seem to be needed (see below), they can be implemented later if necessary.
- mv and ln cannot create hard links across partitions, files have to be physically copied to put them on a different partition.
- The raid server must simultaneously read and write at 300 MB/s, it's best to avoid additional file copying.
- We have two completely independent RAID servers, 75 TB each.
- CRON jobs will run under the hdsys or root account as appropriate (not out of the hdops account).
Notes for Dec 2013 Online Data Challenge
- We plan to use a basic autmomated file transfer mechanism in Dec that deletes files on transfer. If someone has the time we'll try just-in-time deletion.
- Use one RAID server with one DAQ partition. The second RAID server is a hot spare, to be used if the first one dies or if we lose connection to the CC and we need to store data locally.
- Mechanism to switch to the spare RAID server to be determined. To start it can be manual, eventually it will be automated.
- The ER will write data to a run-dependent active directory on the RAID server. The ER runs in the hdops account.
- An end run script will move the all files from that run to a separate run-dependent staging directory on the RAID server.
- The script will also eventually perform run bookkeeping tasks, and create additional hard links if/when we implement just-in-time deletion.
- The jmirror cron job will be run every 5-10 minutes from the root account. It will use a file lock to ensure only one copy runs at a time.
- The jmirror cron job will transfer all files in the staging directory area to the tape storage facility. After transfer hard links in the staging area will be deleted.
- The jmorror cron job will further delete empty directories in the staging area, since being empty means all its files have been transferred.
- Another cron job will periodically check for files in active directories from previous runs that never got moved to the staging area. This can happen if the ER crashes or a run ends badly.
Tasks, Assignments and Schedule
Install RAID system - Paul - by 30-Oct- done 29-Oct-2013
Install and test jmirror software and certificates - Paul and Chris L - by 1-Nov- done 29-Oct-2013
Set up active and staging directories on RAID server - Elliott - by 1-Nov- done 29-Oct-2013
Test transfer scheme - Paul - by 5-Nov- done 29-oct-2013
Write and test jmirror script - Paul and Elliott - by 8-Nov- done 29-Oct-2013
Implement and test jmirror CRON job - Paul and Elliott - by 8-Nov- done 30-Oct-2013
- Write and test end run script - Elliott and Dave L - by 8-Nov
- Write and test cleanup CRON scripts - Elliott and Paul/Dave L - by 13-Nov
- Full high-speed system test - Elliott and Dave L - before and during Dec 2013 Online Data Challenge
Here is an e-mail exchange Elliott forwarded to me in September 2013 regarding the strategy for writing to the tape library.
-------- Original Message -------- Subject: Re: Tape write speeds Date: Wed, 4 Sep 2013 14:18:05 -0400 (EDT) From: Sandy Philpott <email@example.com> To: Elliott Wolin <firstname.lastname@example.org>, Paul Letta <email@example.com>, "firstname.lastname@example.org" <email@example.com> One note, when it's installed - rather than using the Counting House staging fileserver mssstg.jlab.org (Hall A/B/C's "mass storage system staging" node sfs61), Hall D will have their own staging fileserver to write raw data to, for copying raw data directly to tape by the hall data writing tool.Then the mssstg node can serve as a backup. From: "Christopher Larrieu" <firstname.lastname@example.org> To: "Kurt Strosahl" <email@example.com> Cc: "Sandy Phillpot" <firstname.lastname@example.org>, email@example.com, "Elliott Wolin" <firstname.lastname@example.org>, "email@example.com Letta" <firstname.lastname@example.org> Sent: Tuesday, September 3, 2013 9:20:41 AM Subject: Re: Tape write speeds Kurt and Elliot, Please consider the following in benchmarking hall d to-tape data rates: You will need to produce an appropriate quantity of data over a large enough time span to induce steady-state behavior. Hall data is treated differently from user data, so you need to use the appropriate process to shuttle data to tape. Though file size is largely irrelevant, you should nonetheless consider that larger files are more cumbersome to deal with (for example, if some glitch interrupts transfer to tape, the quantity of data that would need to be re-transmitted is potentially much greater). On the other hand, small files would necessarily require greater overhead in book-keeping and management. I would think something like 10G would be a good size to aim for given your data rates. So my suggestion is the following: Produce simulated data files (possibly the same one over and over again) of the size and at the rate you anticipate. Place these files in sfs61:/export/stage/halld/whatever so they can be handled by the hall data writing tool. Conduct the test without interruption for at least several days. You will need to create the halld directory and also add a crontab entry to launch jmigrate. You can use entry for hallb as a template. (I can't do this because I can't figure out how to sudo on this machine). Chris On Sep 1, 2013, at 11:17 PM, Christopher Larrieu <email@example.com> wrote: Please have Elliot come talk to me. If he wants to benchmark writing to tape he should do it in a way that does not go through the fairies. Sent from my iPhone On Aug 30, 2013, at 15:58, Kurt Strosahl <firstname.lastname@example.org> wrote: Chris, That user (Elliot Wolin) who was asking me about optimum write sizes for LTO6 tapes also asked me about some writes hall D was doing... He said that he was getting some very slow writes, and when I dug into it I found the below, can you think of a reason why there was a drop in write speed for some of those files? scdm7 ~> grep hdops mover.log.2013-08-27 [2013-08-27 14:45:47] [jputter:72512043] INFO JputJob.java:233 - Reading network file (/lustre/scicomp/jasmine/fairy/staged/mss/halld/halld-scratch/hdops/et2evio_000000.evio.007) via user (hdops) proxy subprocess. [2013-08-27 14:45:47] [jputter:72512042] INFO JputJob.java:233 - Reading network file (/lustre/scicomp/jasmine/fairy/staged/mss/halld/halld-scratch/hdops/et2evio_000000.evio.008) via user (hdops) proxy subprocess. [2013-08-27 14:49:57] [jputter:72512043] INFO ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.007 10,000,671,248 bytes in 51.946 seconds (183.602 MiB/sec) [2013-08-27 14:49:57] [jputter:72512044] INFO JputJob.java:233 - Reading network file (/lustre/scicomp/jasmine/fairy/staged/mss/halld/halld-scratch/hdops/et2evio_000000.evio.006) via user (hdops) proxy subprocess. [2013-08-27 14:50:50] [jputter:72512042] INFO ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.008 10,000,174,392 bytes in 46.084 seconds (206.946 MiB/sec) [2013-08-27 14:51:27] [jputter:72512041] INFO ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.009 10,000,991,092 bytes in 32.744 seconds (291.280 MiB/sec) [2013-08-27 14:52:46] [jputter:72512045] INFO ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.002 10,001,220,200 bytes in 74.892 seconds (127.355 MiB/sec) [2013-08-27 14:53:40] [jputter:72512044] INFO ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.006 10,000,160,936 bytes in 51.554 seconds (184.988 MiB/sec) [2013-08-27 14:53:40] [jputter:72512049] INFO JputJob.java:233 - Reading network file (/lustre/scicomp/jasmine/fairy/staged/mss/halld/halld-scratch/hdops/et2evio_000000.evio.000) via user (hdops) proxy subprocess. [2013-08-27 14:53:40] [jputter:72512050] INFO JputJob.java:233 - Reading network file (/lustre/scicomp/jasmine/fairy/staged/mss/halld/halld-scratch/hdops/et2evio_000000.evio.003) via user (hdops) proxy subprocess. [2013-08-27 14:56:34] [jputter:72512046] INFO ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.005 10,000,174,576 bytes in 169.538 seconds (56.252 MiB/sec) [2013-08-27 14:59:07] [jputter:72512047] INFO ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.001 10,000,190,452 bytes in 149.692 seconds (63.710 MiB/sec) [2013-08-27 15:00:14] [jputter:72512049] INFO ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.000 10,000,188,212 bytes in 52.920 seconds (180.214 MiB/sec) [2013-08-27 15:01:02] [jputter:72512050] INFO ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.003 10,000,246,016 bytes in 43.402 seconds (219.736 MiB/sec) [2013-08-27 15:01:34] [jputter:72512048] INFO ChannelCopier.java:149 - /mss/halld/halld-scratch/hdops/et2evio_000000.evio.004 10,000,216,984 bytes in 30.424 seconds (313.468 MiB/sec) Kurt J. Strosahl System Administrator Scientific Computing Group, Thomas Jefferson National Accelerator Facility -- Christopher Larrieu Computer Scientist High Performance and Scientific Computing Thomas Jefferson National Accelerator Facility -- Sincerely, Elliott ================================================================================ Those raised in a morally relative or neutral environment will hold no truths to be self-evident. Elliott Wolin Staff Physicist, Jefferson Lab 12000 Jefferson Ave Suite 8 MS 12A1 Newport News, VA 23606 757-269-7365 ================================================================================