Difference between revisions of "CMU Data Challenge 2"

From GlueXWiki
Jump to: navigation, search
(Created page with "# At CMU we are using 12 boxes, each with 4 8-core AMD Opteron Processors (32 cores per box). Each box has 64GB of physical memory. Data are being written to a local RAID disk. J...")
 
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
# At CMU we are using 12 boxes, each with 4 8-core AMD Opteron Processors (32 cores per box). Each box has 64GB of physical memory. Data are being written to a local RAID disk. Jobs are manage by PBS (torque and maui).  
 
# At CMU we are using 12 boxes, each with 4 8-core AMD Opteron Processors (32 cores per box). Each box has 64GB of physical memory. Data are being written to a local RAID disk. Jobs are manage by PBS (torque and maui).  
 
# All 384 cores are reserved for the data challenge for three weeks.
 
# All 384 cores are reserved for the data challenge for three weeks.
 +
# Did not switch to optional version.
 
# Start-up Problems
 
# Start-up Problems
 +
#* All jobs were initially reading from the same copy of sqlite, resources, and hdds, instead of having their own copies.
 
#* Large-cluster configuration problems slowed our start. Resolved by tuning PBS parameters to control the rate at which pbs_mom talked to the head node.
 
#* Large-cluster configuration problems slowed our start. Resolved by tuning PBS parameters to control the rate at which pbs_mom talked to the head node.
 
#* Still battling a scheduler issue. Work-around has been found.
 
#* Still battling a scheduler issue. Work-around has been found.
 
#* Running smoothly since ~Tuesday.
 
#* Running smoothly since ~Tuesday.
# As of 9:00am, 1087 jobs have completed.
+
# Final Tally: 7000 jobs, 3 failures:
#* 9001 Series - 784    1E7 with EM Background  
+
#* 9001 Series - 5600  1E7 with EM Background (25k Events Each) : 139.87 MEvents, 1 failure (small REST file):
#* 9002 Series - 225   1E7 without EM Background
+
#** 09001_0000136: DMagneticFieldMapFineMesh::GetFieldAndGradient()
#* 9003 Series - 89   5E7 with EM Background
+
#* 9002 Series - 875   5E7 with EM Background (10k Events Each) : 8.75 MEvents : 0 failures
 +
#* 9003 Series - 525   without EM Background (50k Events Each) : 26.15 MEvents : 2 failures:
 +
#** 09003_0000014: lost to the aether (no record of it) (likely pbs fail)
 +
#** 09003_0000392: timed-out ~9-10k events into hdgeant (96 hrs)

Latest revision as of 11:55, 14 April 2014

  1. At CMU we are using 12 boxes, each with 4 8-core AMD Opteron Processors (32 cores per box). Each box has 64GB of physical memory. Data are being written to a local RAID disk. Jobs are manage by PBS (torque and maui).
  2. All 384 cores are reserved for the data challenge for three weeks.
  3. Did not switch to optional version.
  4. Start-up Problems
    • All jobs were initially reading from the same copy of sqlite, resources, and hdds, instead of having their own copies.
    • Large-cluster configuration problems slowed our start. Resolved by tuning PBS parameters to control the rate at which pbs_mom talked to the head node.
    • Still battling a scheduler issue. Work-around has been found.
    • Running smoothly since ~Tuesday.
  5. Final Tally: 7000 jobs, 3 failures:
    • 9001 Series - 5600 1E7 with EM Background (25k Events Each) : 139.87 MEvents, 1 failure (small REST file):
      • 09001_0000136: DMagneticFieldMapFineMesh::GetFieldAndGradient()
    • 9002 Series - 875 5E7 with EM Background (10k Events Each) : 8.75 MEvents : 0 failures
    • 9003 Series - 525 without EM Background (50k Events Each) : 26.15 MEvents : 2 failures:
      • 09003_0000014: lost to the aether (no record of it) (likely pbs fail)
      • 09003_0000392: timed-out ~9-10k events into hdgeant (96 hrs)