NU DC2 Tests

From GlueXWiki
Jump to: navigation, search
  • I've been running jobs simulating 10K events using the same package versions as Mark described in the meeting on Friday.
  • The machines have 0.75 - 1.5 GB/core of memory.
  • There are no resource limits
  • I've gotten to a success rate of >50% (the exact number is uncertain since I was staging some of the intermediate files on disks local to the nodes, which would fill up sometimes).
  • Nearly all failures happened at the REST stage, and were usually due to a thread taking too long and being killed. I've increased the thread timeout to 90s, and this seems to help.
  • The REST processes do get up to 1.5-2 GB in size
  • The failed jobs do seem consistent with either hitting some events that take very long to reconstruct or being resource starved. I'm going to see what I can find out about the events on which the jobs die.
  • I'm also running jobs simulating 50K events to more closely reproduce Mark's results.

--Sdobbs 00:34, 10 February 2014 (EST)

  • Ran jobs overnight with 50k events each.
  • Killed ~1/3 after 13+ hours.
  • Success rate >40%, but lots of failures at mcsmear level this time
  • Jobs on new nodes were running fine, but older ones were clearly resource-constrained (likely memory)

--Sdobbs 14:36, 10 February 2014 (EST)

  • Ran jobs with 10K events each.
  • Success rate of 161/250 = 64%.
  • 18 jobs hung at beginning of REST generation, still in queue
  • ~10 jobs seemed to succeed but had REST files truncated, all from one node. also problem with d/l'ing magnetic field - seems to be a problem with disk space on the node?
  • most of the others died at the beginning of REST file generation without any useful info.
  • a few other misc failures
  • going to run the same with gdb to try and get some more useful info

--Sdobbs 15:15, 11 February 2014 (EST)

  • Two nodes hung in previous run (dreaded automounter hangings). Restarting 250 jobs w/ 10K events.

--Sdobbs 13:01, 12 February 2014 (EST)

  • 17 jobs crashed due to problems downloading the magnetic field files - curl download loses connection to squid proxy =( Will try running using CvmFS when configuration changes actually propogate?
  • No other crashes noted, ~3 REST files didn't copy over well for some reason (NFS problems?)
  • Could be some other throttling due to dumping MC tree for each event or running under gdb.
  • Am killing ~30 jobs which are taking > 3 hours to complete - this seems to happen due to EM background?
  • Need to investigate squid not caching correctly.

--Sdobbs 16:51, 12 February 2014 (EST)

  • Running 250 jobs of 1k each.
  • Caching resources (magnetic field) on disk, not just CCDB
  • Most REST jobs crashed, was running through gdb, most crashes seem to be at REST output functions.
  • Ran jobs again staging just REST output on local disk (some nodes don't have enough local disk to cache everything), and had success rate > 95% (though I did stop the jobs early).
  • Am running 50k jobs overnight.

--Sdobbs 19:03, 12 February 2014 (EST)

  • 50k jobs stopped due to stupid.
  • restarted 1000 50k event jobs.
  • Have test cvmfs distro working, now just need to fix squid caching and will be all set to test it
  • Found interesting failure mode: when running a cluster full of 1k jobs with REST hd_ana jobs writing to network disk and running under gdb, essentially all jobs crash. A majority of the jobs crashed when accessing geometry files. When I started caching the HDDS files locally, the jobs all crashed while building/accessing the magnetic field. Not sure if this is due to an actual bug or gdb just being weird. Anyway, am writing REST files to local disk now, so let's not worry about that for now.

--Sdobbs 15:37, 13 February 2014 (EST)

  • Running jobs under cvmfs had a very high crash rate due to problems loading the software. I am wondering how well squid is actually caching. Maybe it is getting overloaded - will try to collect some data next week after CM.
  • Current round of 50k jobs (using Mark's new version, loading over NFS) are running with ~70-80% success rate.
  • Large number of jobs (~10%?) will die in REST production at the outset, and then hang around and not die. This is sad and ungood.
  • Why the large number of unsuccessful jobs, when we haven't seen these with other experiments? I think the problem is due to file transfer errors, either due to underpowered servers or because of some weird software mismatches on the network level . I plan to do some configuration changes, but I should finish the Ubuntu -> CentOS 6 move (just realized - most cluster running has been with compute nodes running Ubuntu. I wonder if TCP congestion control is a problem, or other algorithms??). In any case, grid usage will not use NFS to transfer most files, so hopefully this won't be a problem. Will also try running one hd_ana process per node and see if that helps.
  • Why do my jobs not seem to use large amounts of memory (<2 GB, hdgeant ~700 MB) as seen by Paul & Kei? Need to collect data.

--Sdobbs 11:50, 18 February 2014 (EST)

  • Note that 14 jobs out of 500 (~3%) did not produce mcsmear results, looks like hdgeant jobs hung, likely dreaded automounter failures. Going to run in single-slot-per-node mode, see how that affects hd_ana failure rate.

--Sdobbs 14:52, 18 February 2014 (EST)

  • Things are working better since I rebuilt the master node and fixed some NFS configuration problems.
  • Was having some trouble with newer nodes (12+12HT cores) with jobs crashing on the first event, disabled hyperthreading and things are running much better now
  • Ran 500 10k event jobs with EM bkgd turned on, branches/sim-recon-dc-2 as of Tuesday 2/25, and the following card:
INFILE 'bggen.hddm'
OUTFILE 'hdgeant.hddm'
TRIG 10000
SCAP    0.       0.      65.
CUTS 1e-4 1e-4 1e-3 1e-3 1e-4
GELH  1     0.2     1.0     4     0.160

BEAM 12. 9.
BGGATE -200. 200.

  • 37 jobs still running, many stuck in hdgeant. is this due to old nodes or EM background?
  • 486 REST files made
    • 432 successful!
    • 54 smaller than expected
  • 14 jobs with missing output?

--Sdobbs 17:18, 27 February 2014 (EST)

  • 250 jobs run
  • 25 stalled
  • 197 finished with full file
  • 6 small file