Difference between revisions of "Notes on the Data Challenge"

From GlueXWiki
Jump to: navigation, search
(Created page with "=Ideas= * Curtis's note ** Comprehensive ** See note * Mark's bullet points ** Develop system for submitting and tracking large-volume simulation and reconstruction jobs * Matt'...")
 
(Ideas)
Line 11: Line 11:
 
* David
 
* David
 
** two data challenges: simulation and reconstruction
 
** two data challenges: simulation and reconstruction
*** simulation: run the MC, reconstruct, it produce, reconstructed data
+
*** simulation: run the MC, reconstruct it, produce reconstructed data
 
*** reconstruction: create fake raw data sample, reconstruct it
 
*** reconstruction: create fake raw data sample, reconstruct it
 
** shipping reconstructed data to two institutions
 
** shipping reconstructed data to two institutions

Revision as of 13:08, 11 July 2012

Ideas

  • Curtis's note
    • Comprehensive
    • See note
  • Mark's bullet points
    • Develop system for submitting and tracking large-volume simulation and reconstruction jobs
  • Matt's email
    • Include data analysis system
    • Develop/test system for delivering reconstruction data to data analyzers
  • David
    • two data challenges: simulation and reconstruction
      • simulation: run the MC, reconstruct it, produce reconstructed data
      • reconstruction: create fake raw data sample, reconstruct it
    • shipping reconstructed data to two institutions
    • other specific proposals

Tools

  • EventStore
  • PanDA
    • received offer of help from Torre Wenaus

Intermediate Goal: Mini Data Challenges (reconstruction-type)

  • one major problem to be solved is how to scale:
    • how to generate and run thousands of jobs
    • assess their status (before, during, and after they run)
    • manage all output files and diagnostic data
    • same issues for simulation and reconstruction: want a common framework
  • with this in place we can iterate in mini-data challenges
    • wrong data mix?: change it
    • wrong output format?: change it
    • wrong photon reconstruction algorithm? change it
  • we want to be a position where re-running a mini-challenge is not big deal
  • in parallel develop everything else
    • code correctness
    • execution speed
    • design and implement analysis system
    • raw data generation
    • planning for test bed for full data challenge
    • reconstructed data format
  • find bottle-necks at intermediate scale
  • say by September 1 and every two weeks after that

Analysis System

  • another major problem: we don't have one
  • more of a design and development effort
  • can be fed by data from mini-challenges
    • event format
    • storage requirements/configuration
    • data discovery
    • user tools