Difference between revisions of "Raid-to-Silo Transfer Strategy"

From GlueXWiki
Jump to: navigation, search
m
m
Line 49: Line 49:
 
* When a file is closed CODA3 will invoke our script to initiate transfer of the file.
 
* When a file is closed CODA3 will invoke our script to initiate transfer of the file.
 
* The script will moved the just closed file to a separate run-dependent staging directory on the RAID server.
 
* The script will moved the just closed file to a separate run-dependent staging directory on the RAID server.
* The script will also perform run bookkeeping tasks, and create additional hard links if/when we implement just-in-time deletion.
+
* The script will also eventually perform run bookkeeping tasks, and create additional hard links if/when we implement just-in-time deletion.
 
* The jmirror cron job will be run every 5-10 minutes from the hdsys account.
 
* The jmirror cron job will be run every 5-10 minutes from the hdsys account.
 
* The jmirror cron job simply transfers all files in the staging directory area to the tape storage facility.  When transferred it deletes the hard links in the staging area.
 
* The jmirror cron job simply transfers all files in the staging directory area to the tape storage facility.  When transferred it deletes the hard links in the staging area.
Line 60: Line 60:
  
 
# Install RAID system - Paul - by 30-Oct
 
# Install RAID system - Paul - by 30-Oct
# Install and test jmirror software - Paul and Chris L - by 1-Nov
+
# Install and test jmirror software and certificates - Paul and Chris L - by 1-Nov
 
# Set up active and staging directories on RAID server - Elliott - by 1-Nov
 
# Set up active and staging directories on RAID server - Elliott - by 1-Nov
# Test directory and transfer scheme - Elliott and Paul - by 5-Nov
+
# Test transfer scheme using production directory strategy - Elliott and Paul - by 5-Nov
 
# Write and test jmirror CRON script - Paul and Elliott - by 8-Nov
 
# Write and test jmirror CRON script - Paul and Elliott - by 8-Nov
 
# Implement user script capability in ER - Vardan and Carl - by 5-Nov
 
# Implement user script capability in ER - Vardan and Carl - by 5-Nov
 
# Write and test ER script - Elliott and Dave L - by 8-Nov
 
# Write and test ER script - Elliott and Dave L - by 8-Nov
 
# Write and test cleanup CRON scripts - Elliott and Paul - by 13-Nov
 
# Write and test cleanup CRON scripts - Elliott and Paul - by 13-Nov

Revision as of 09:20, 25 October 2013

Below is a proposal for a raid-to-silo transfer strategy for moving Hall D data files from our local raid server to the JLab tape storage facility. We will update this as our ideas develop.

Elliott Wolin
Dave Lawrence
24-Oct-2013


Notes

  • We will use the jmirror facility from the Computer Center to transfer the files.
  • jmirror deletes the link to the file when the transfer is complete. It does not delete directories, only files.
  • jmirror is fairly smart and reliable. It only deletes the hard link when the file is safely transferred.
  • CRON jobs will delete unneeded dirs after their contents are safely transferred.
  • jmirror is run periodically via a CRON job, it is not a tranfer server system. It transfers files it finds when it is run.
  • jmirror will not transfer files actively being written to, nor transfer files twice if invoked twice.
  • Additional hard links to the data file are untouched by jmirror. These can be used to keep the file on disk after transfer.
  • If files are kept they will be deleted "just-in-time" to make room for new DAQ files. This will require cleanup strategy and cron scripts to implement it.
  • The DAQ creates a 10 GB file every 30 secs, about 1 TB/hour. Thus a two hour run generates 2 TB.
  • It is preferable to transfer files as they are ready for transfer, and not wait for the run to end before initiating transfer.
  • The simplest way to implement immediate transfer is for run control to run a script every time the ER closes a file.
  • Vardan and Carl are working out a simple scheme to allow users to specify such a script and have it run when a file is closed.
  • Mark I prefers to store files by "run period" with a simple naming scheme (RunPeriod001, RunPeriod002 or similar).
  • Run periods are just date ranges. Run numbers will NOT be reused, i.e. all run numbers are unique across all run periods.
  • Due to constraints in the mss a second level of directories is needed. Mark and I propose simply organizing files by run, e.g. something like Run000001, Run000002, etc.
  • Run files will have the run number in them, e.g something like: Run000001.evio.001, Run000001.evio.002, etc.
  • A two-hour run will generate around 250 files.
  • The RAID sytem stripes data across all disks, independent of logical partitioning.
  • RAID disk partitions do not seem to be needed (see below), they can be implemented later if necessary.
  • mv and ln cannot create hard links across partitions, files have to be physically copied to put them on a different partition.
  • The raid server must simultaneously read and write at 300 MB/s, it's best to avoid additional file copying.
  • We have two completely independent RAID servers, 75 TB each.
  • All CRON jobs will run under the hdsys account.



Notes for Dec 2013 Online Data Challenge

  • We plan to use a basic autmomated file transfer mechanism in Dec that deletes files on transfer. If someone has the time we'll try just-in-time deletion.


Proposal

  • Use one RAID server with one DAQ partition. The second RAID server is a hot spare, to be used if the first one dies or if we lose connection to the CC and we need to store data locally.
  • Mechanism to switch to the spare RAID server to be determined. To start it can be manual, perhaps it can eventually be automated.
  • The ER will write data to a run-dependent active directory on the RAID server. The ER runs in the hdops account.
  • When a file is closed CODA3 will invoke our script to initiate transfer of the file.
  • The script will moved the just closed file to a separate run-dependent staging directory on the RAID server.
  • The script will also eventually perform run bookkeeping tasks, and create additional hard links if/when we implement just-in-time deletion.
  • The jmirror cron job will be run every 5-10 minutes from the hdsys account.
  • The jmirror cron job simply transfers all files in the staging directory area to the tape storage facility. When transferred it deletes the hard links in the staging area.
  • A cron job will delete empty directories in the staging area, since being empty means all its files have been transferred.
  • A cron job will periodically check for files in active directories from previous runs that never got moved to the staging area. This can happen if the ER crashes or a run ends badly.


Tasks, Assignments and Schedule

  1. Install RAID system - Paul - by 30-Oct
  2. Install and test jmirror software and certificates - Paul and Chris L - by 1-Nov
  3. Set up active and staging directories on RAID server - Elliott - by 1-Nov
  4. Test transfer scheme using production directory strategy - Elliott and Paul - by 5-Nov
  5. Write and test jmirror CRON script - Paul and Elliott - by 8-Nov
  6. Implement user script capability in ER - Vardan and Carl - by 5-Nov
  7. Write and test ER script - Elliott and Dave L - by 8-Nov
  8. Write and test cleanup CRON scripts - Elliott and Paul - by 13-Nov