Difference between revisions of "CMU Data Challenge 2"
From GlueXWiki
Line 8: | Line 8: | ||
#* Running smoothly since ~Tuesday. | #* Running smoothly since ~Tuesday. | ||
# Final Tally: 7000 jobs: | # Final Tally: 7000 jobs: | ||
− | #* 9001 Series - 5600 1E7 with EM Background (25k Events Each) : 139.87 MEvents, 1 failure | + | #* 9001 Series - 5600 1E7 with EM Background (25k Events Each) : 139.87 MEvents, 1 failure: |
− | #** | + | #** 09001_0000136: DMagneticFieldMapFineMesh::GetFieldAndGradient() |
#* 9002 Series - 875 5E7 with EM Background (10k Events Each) : 8.75 MEvents : 0 failures | #* 9002 Series - 875 5E7 with EM Background (10k Events Each) : 8.75 MEvents : 0 failures | ||
− | #* 9003 Series - 525 without EM Background (50k Events Each) : 26.15 MEvents : 2 failures | + | #* 9003 Series - 525 without EM Background (50k Events Each) : 26.15 MEvents : 2 failures: |
+ | #** 1 Job lost to the aether (likely pbs fail) | ||
+ | #** 09003_0000392: timed-out ~9-10k events into hdgeant (96 hrs) |
Revision as of 11:52, 14 April 2014
- At CMU we are using 12 boxes, each with 4 8-core AMD Opteron Processors (32 cores per box). Each box has 64GB of physical memory. Data are being written to a local RAID disk. Jobs are manage by PBS (torque and maui).
- All 384 cores are reserved for the data challenge for three weeks.
- Did not switch to optional version.
- Start-up Problems
- All jobs were initially reading from the same copy of sqlite, resources, and hdds, instead of having their own copies.
- Large-cluster configuration problems slowed our start. Resolved by tuning PBS parameters to control the rate at which pbs_mom talked to the head node.
- Still battling a scheduler issue. Work-around has been found.
- Running smoothly since ~Tuesday.
- Final Tally: 7000 jobs:
- 9001 Series - 5600 1E7 with EM Background (25k Events Each) : 139.87 MEvents, 1 failure:
- 09001_0000136: DMagneticFieldMapFineMesh::GetFieldAndGradient()
- 9002 Series - 875 5E7 with EM Background (10k Events Each) : 8.75 MEvents : 0 failures
- 9003 Series - 525 without EM Background (50k Events Each) : 26.15 MEvents : 2 failures:
- 1 Job lost to the aether (likely pbs fail)
- 09003_0000392: timed-out ~9-10k events into hdgeant (96 hrs)
- 9001 Series - 5600 1E7 with EM Background (25k Events Each) : 139.87 MEvents, 1 failure: