CMU Data Challenge 2
From GlueXWiki
- At CMU we are using 12 boxes, each with 4 8-core AMD Opteron Processors (32 cores per box). Each box has 64GB of physical memory. Data are being written to a local RAID disk. Jobs are manage by PBS (torque and maui).
- All 384 cores are reserved for the data challenge for three weeks.
- Did not switch to optional version.
- Start-up Problems
- All jobs were initially reading from the same copy of sqlite, resources, and hdds, instead of having their own copies.
- Large-cluster configuration problems slowed our start. Resolved by tuning PBS parameters to control the rate at which pbs_mom talked to the head node.
- Still battling a scheduler issue. Work-around has been found.
- Running smoothly since ~Tuesday.
- Final Tally: 7000 jobs, 3 failures:
- 9001 Series - 5600 1E7 with EM Background (25k Events Each) : 139.87 MEvents, 1 failure (small REST file):
- 09001_0000136: DMagneticFieldMapFineMesh::GetFieldAndGradient()
- 9002 Series - 875 5E7 with EM Background (10k Events Each) : 8.75 MEvents : 0 failures
- 9003 Series - 525 without EM Background (50k Events Each) : 26.15 MEvents : 2 failures:
- 09003_0000014: lost to the aether (no record of it) (likely pbs fail)
- 09003_0000392: timed-out ~9-10k events into hdgeant (96 hrs)
- 9001 Series - 5600 1E7 with EM Background (25k Events Each) : 139.87 MEvents, 1 failure (small REST file):