Difference between revisions of "CMU Data Challenge 2"

Latest revision as of 11:55, 14 April 2014

At CMU we are using 12 boxes, each with 4 8-core AMD Opteron Processors (32 cores per box). Each box has 64GB of physical memory. Data are being written to a local RAID disk. Jobs are manage by PBS (torque and maui).
All 384 cores are reserved for the data challenge for three weeks.
Did not switch to optional version.
Start-up Problems
- All jobs were initially reading from the same copy of sqlite, resources, and hdds, instead of having their own copies.
- Large-cluster configuration problems slowed our start. Resolved by tuning PBS parameters to control the rate at which pbs_mom talked to the head node.
- Still battling a scheduler issue. Work-around has been found.
- Running smoothly since ~Tuesday.
Final Tally: 7000 jobs, 3 failures:
- 9001 Series - 5600 1E7 with EM Background (25k Events Each) : 139.87 MEvents, 1 failure (small REST file):
  - 09001_0000136: DMagneticFieldMapFineMesh::GetFieldAndGradient()
- 9002 Series - 875 5E7 with EM Background (10k Events Each) : 8.75 MEvents : 0 failures
- 9003 Series - 525 without EM Background (50k Events Each) : 26.15 MEvents : 2 failures:
  - 09003_0000014: lost to the aether (no record of it) (likely pbs fail)
  - 09003_0000392: timed-out ~9-10k events into hdgeant (96 hrs)

@@ Line 1: / Line 1: @@
 # At CMU we are using 12 boxes, each with 4 8-core AMD Opteron Processors (32 cores per box). Each box has 64GB of physical memory. Data are being written to a local RAID disk. Jobs are manage by PBS (torque and maui).
 # All 384 cores are reserved for the data challenge for three weeks.
+# Did not switch to optional version.
 # Start-up Problems
+#* All jobs were initially reading from the same copy of sqlite, resources, and hdds, instead of having their own copies.
 #* Large-cluster configuration problems slowed our start. Resolved by tuning PBS parameters to control the rate at which pbs_mom talked to the head node.
 #* Still battling a scheduler issue. Work-around has been found.
 #* Running smoothly since ~Tuesday.
-# As of 9:00am, 1087 jobs have completed.
+# Final Tally: 7000 jobs, 3 failures:
-#* 9001 Series - 784    1E7 with EM Background
+#* 9001 Series - 5600   1E7 with EM Background (25k Events Each) : 139.87 MEvents, 1 failure (small REST file):
-#* 9002 Series - 225    1E7 without EM Background
+#** 09001_0000136: DMagneticFieldMapFineMesh::GetFieldAndGradient()
-#* 9003 Series -  89    5E7 with EM Background
+#* 9002 Series - 875    5E7 with EM Background (10k Events Each) : 8.75 MEvents : 0 failures
+#* 9003 Series - 525    without EM Background  (50k Events Each) : 26.15 MEvents : 2 failures:
+#** 09003_0000014: lost to the aether (no record of it) (likely pbs fail)
+#** 09003_0000392: timed-out ~9-10k events into hdgeant (96 hrs)

Difference between revisions of "CMU Data Challenge 2"

Latest revision as of 11:55, 14 April 2014

Navigation menu

Views

Personal tools

Navigation

Search

Tools