Difference between revisions of "HDDM Programmer's Interface"

From GlueXWiki
Jump to: navigation, search
(writing hddm files in C++)
(writing hddm files in C++)
Line 222: Line 222:
 
</pre>
 
</pre>
  
The structure of the output record you are going to write is already known to the program because it knows about your template. All that you need to do is to fill in the elements and assign the values of the attributes. You begin by creating an empty record by calling the HDDM() default constructor. Then you populate the structure top-down by calling addXXXs() methods for each tag XXX under that. The name XXXs is the name of the tag element in the template in a capitalized-plural form. The addXXXs() methods take a single optional int argument, which is the number of copies of that element that need to be added (default is 1). They return a subclass of std::list that can be indexed with operator()(int) to access the individual members of the list. Each of these has addXXXs() methods for each of its contents, and so on down the tree. You can omit whole branches of the tree by simply not calling the corresponding addXXXs() method, although xml rules require that you specify minOccurs="0" for the containing tag in the template if you plan to do that. As soon as a new element list is created, you can fill in the values of its attributes using set<attname> methods, as illustrated in the example, where <attname> is a capitalized version of the names of the attribute in the template.
+
The structure of the output record you are going to write is already known to the program because it knows about your template. All that you need to do is to fill in the elements and assign the values of the attributes. You begin by creating an empty record by calling the HDDM() default constructor. Then you populate the structure top-down by calling addXXXs() methods for each tag XXX under that. The name XXXs is the name of the tag element in the template in a capitalized-plural form. The addXXXs() methods take a single optional int argument, which is the number of copies of that element that need to be added (default is 1). They return a subclass of std::list that can be iterated over in the usual fashion, or indexed with operator()(int) to access the individual members of the list. Each of these has addXXXs() methods for each of its contents, and so on down the tree. You can omit whole branches of the tree by simply not calling the corresponding addXXXs() method, although xml rules require that you specify minOccurs="0" for the containing tag in the template if you plan to do that. As soon as a new element list is created, you can fill in the values of its attributes using set<attname> methods, as illustrated in the example, where <attname> is a capitalized version of the names of the attribute in the template.
  
 
===reading hddm files in C++===
 
===reading hddm files in C++===

Revision as of 10:04, 1 July 2016

Introduction

HDDM was introduced in the context of GlueX as a means to encode output from Monte Carlo simulations and results from their reconstruction. To understand why we needed something like HDDM, rather than going with a community-based standard such as HDF, see [1] below. That reference also contains a description of the design principles and requirements for the software package. The purpose of this wiki page is to give a quick-start guide for programmers that might want a way to write new hddm files or read data from existing files. The package comes with a set of tools and programmer interfaces that makes this very easy to do, particularly with python. The underlying implementation is in C++, so it provides good performance in terms of data rate to/from disk files with serial access. On-the-fly compression/decompression and automatic integrity verification are built into the package. Random-access to events at any location in a file without reading the entire file is also supported.

Templates and schemas

HDDM files are built from an xml template. A hddm template is a short xml document that describes the structure of one record in the hddm stream. Every hddm file has a copy of its template at the beginning, followed by its event data in a compact binary format. The template is what arranges those data into a meaningful structure. A simple example of a template is given below.

<?xml version="1.0" encoding="UTF-8"?>
<HDDM class="x" version="1.0" xmlns="http://www.gluex.org/hddm">
  <student name="string" minOccurs="0">
    <enrolled year="int" semester="int" maxOccurs="unbounded">
      <course credits="int" title="string" maxOccurs="unbounded">
        <result grade="string" Pass="boolean" />
      </course>
    </enrolled>
  </student>
</HDDM>

All of the events in the file represent repeats of this basic structure, with different values in the data fields. All actual data values are represented as attributes of tags. Attributes that are assigned type names ("string", "int", "long", "float", "double", "boolean", "anyURI", and "Particle_t") are user data. Any other values are treated as literal strings, and do not take up space in the file (other than in the template header). Some of these literal attributes function as metadata, eg. you might want to add an attribute unit="GeV" to document the units used for other attributes in a tag. Others like minOccurs/maxOccurs tell the data model whether a given element is always present in every event or may be omitted (minOccurs="0" indicates this) or whether it may be repeated any number of times (maxOccurs="unbounded" indicates this). The top-level element is special in that it must always be named HDDM and have the attributes shown above. The class attribute is an abbreviation that you chose for the data model you are creating. Chose a short, unique name for your class because it is used in filenames that are written by the hddm tools, and the abbreviation prevents collisions between files built from different templates (classes).

Templates provide an intuitive informal way of specifying the structure of a record in a hddm file. For most users, this is all they need to know about, but for those familiar with XML there is a more formal way to specify the structure of an xml document which is called a xml schema. HDDM uses schemas in two different ways. The first is to specify the structure of the templates themselves; the above template conforms to a schema called "http://www.gluex.org/hddm" (hint: this is not a URL to anywhere, it is a URI known as an 'xml namespace'), as indicated in the xmlns attribute in the HDDM tag of the template. The schema for this document type is found in hddm_schema.xsl in the main hddm directory of the distribution. The second use of schemas is that every hddm file is itself a valid xml document, so it needs a schema against which it can be verified. The hddm toolkit provides a pair of tools hddm-schema and schema-hddm that convert back and forth between templates and schema. The two are equivalent ways of representing the same information about the structure of a hddm record, with the schema being more complete and standards-based, while the template is much shorter and more intuitive to most users. Schemas provide a much more general set of constraints that can be expressed for the data and relationships between them, but experience has shown that their practical use for this purpose is very limited, except for specialists. For the remainder of this document, we will deal only with templates.

How to get started

The hddm toolkit is distributed as a part of the GlueX sim-recon distribution. The sim-recon distribution is distributed from the github repository as JeffersonLab/sim-recon. Instructions for how to download and build sim-recon are given elsewhere on this wiki. The hddm tools are located in sim-recon/src/programs/Utilities/hddm. Checking out the repository, setting up your build environment, and executing "scons -u install" from sim-recon/src/Utilities/hddm should be all that is needed to build the hddm toolkit. Before continuing to read this document, make sure that the basic tools like hddm-xml, xml-hddm, hddm-c, hddm-cpp, hddm-py, and xml-xml are in your shell PATH. These tools are not the hddm libraries themselves, but the tools you need to build the libraries from a template.

Before you can begin to work with hddm files, you need a template. There is a template at the head of every hddm file, so if you have a hddm file that has already been created that you want to work with, simply extract the header using a text editor and save it to a file with the extension ".xml". Another way to get started would be to copy/paste the above example template into a file "exam2.xml" (or copy it from the distribution directory). The instructions that follow assume that you have done this. Now it is time to build a hddm i/o library to let you read and write hddm data records. Currently there are 3 programming languages supported by hddm: python, C++, and c. Python is the least verbose and most readable interface, so let's begin with that.

Independent of any user programs or language-specific API, the hddm toolkit provides two tools that can be used to read and write hddm files directly from the command line. The following command accepts any valid hddm file as input and prints the contents of the file in plain-text xml to standard output.

$ hddm-xml [-n <count>] [-o <output xml file>] <input hddm file>

The reverse action is provided by the xml-hddm tool, assuming that the user has a copy of the template and the input data file already formatted as a valid xml file.

$ xml-hddm -t <template> [-o <output hddm file>] <input xml file>

Since the full xml rendition of a data file with many records is extremely verbose, this tool is of limited use in actual practice, except to process the output from the hddm-xml file and run the decoding procedures in reverse. This can be useful in cases where one might doubt the fidelity of the encoding being used by hddm. These two tools do not require any compile-and-link step each time the template is changed, so they are very useful to keep track of what a hddm file actually contains. Keep them handy when working through the language-specific procedures below.

HDDM in python

If you have access to a hddm file that was written by someone else, copy it into your work directory and use a text editor to extract the header into a file, which you may call "exam.xml". Use the following commands to build the python module that you will need to read the contents of this file.

$ hddm-cpp exam.xml
$ hddm-py exam.xml
$ python setup_hddm_X.py build -b build_hddm_X

In this example, I assumed that the HDDM "class" letter (see the HDDM tag in your template header) was "X". You should change it to whatever the actual class abbreviation is for your hddm file. The above steps should create a shared library that starts with hddm_X in your work directory. Copy that module to a place in your PYTHONPATH where you usually place your private python modules, then execute the following program to print the contents of your hddm file in plain text. I assumed it was called "exam.hddm".

import hddm_X
for rec in hddm_X.istream("exam.hddm"):
   print rec

To see the same data printed out as a properly formatted xml document, replace the "print rec" with "print rec.toXML()".

writing hddm files in python

For this example, I return to the template listed at the top of this page, which I call "exam2.xml". Using the build steps above, build the python module hddm_x, then try the following code to write a new output hddm file from scratch, starting only from the template.

import hddm_x
ofs = hddm_x.ostream(“exam2.hddm”)
xrec = hddm_x.HDDM()
student = xrec.addStudents()
student[0].name = "Humphrey Gaston"
enrolled = student[0].addEnrolleds()
enrolled[0].year = 2005
enrolled[0].semester = 2
course = enrolled[0].addCourses(3)
course[0].credits = 3
course[0].title = "Beginning Russian"
result = course[0].addResults()
result[0].grade = "A-"
result[0].Pass = True
course[1].credits = 1
course[1].title = "Bohemian Poetry"
result = course[1].addResults()
result[0].grade = “C"
result[0].Pass = 1
course[2].credits = 4
course[2].title = "Developmental Psychology"
result = course[2].addResults()
result[0].grade = "B+”
result[0].Pass = True
ofs.write(xrec)

Copy this python program to a file and execute it using the python interpreter. This generates a new hddm file called exam2.hddm. Now running the above 3-line python print program on exam2.hddm should yield the following output.

HDDM
  student name="Humphrey Gaston"
    enrolled semester=2 year=2005
      course credits=3 title="Beginning Russian"
        result Pass=false grade="A-"
      course credits=1 title="Bohemian Poetry"
        result Pass=false grade="C"
      course credits=4 title="Developmental Psychology"
        result Pass=false grade="B+"

The structure of the output record you are going to write is already known to the program because it knows about your template. All that you need to do is to fill in the elements and assign the values of the attributes. You begin by creating an empty record by calling the HDDM() default constructor. Then you populate the structure top-down by calling addXXXs() methods for each tag XXX under that. The name XXXs is the name of the tag element in the template in a capitalized-plural form. The addXXXs() methods take a single optional int argument, which is the number of copies of that element that need to be added (default is 1). They return a list that can be indexed in the usual python fashion to give access to the individual members of the list. Each of these has addXXXs() methods for each of its contents, and so on down the tree. You can omit whole branches of the tree by simply not calling the corresponding addXXXs() method, although xml rules require that you specify minOccurs="0" for the containing tag in the template if you plan to do that. As soon as a new element list is created, you can fill in the values of its attributes using simple assignment semantics, as illustrated in the example. The names of the python data members are the same as the names of the attributes in the template.

reading hddm files in python

For this illustration, I assume you have created the file exam2.hddm using the instructions in the previous section. The following python program lets you open this file and extract bits of information from the first record, writing a summary report at the end. Of course, in actual practice a hddm file would contain many records and the analysis would loop over many instances student.

import hddm_x
ifs = hddm_x.istream("exam2.hddm")
xrec = ifs.read()
total_enrolled = 0
total_courses = 0
total_credits = 0
total_passed = 0
for course in xrec.getCourses():
   total_courses += 1
   if course.getResult().Pass:
      if course.year > 1992:
         total_credits += course.credits
      total_passed += 1
   total_enrolled += 1
print course.name, "enrolled in", total_courses, " courses", \
       "and passed" , total_passed, "of them,\n",\
       "earning a total of", total_credits, "credits.\n"

Running the above code should produce output like the following:

Humphrey Gaston enrolled in 3 courses and passed 3 of them,
earning a total of 8 credits.

The istream object itself functions as an iterable in python so the construct, "for rec in hddm_x.istream("exam2.hddm"):" would look over all records in the input file with the rec loop variable being the HDDM element from each record in the input file. Likewise, each call to method getXXXs() returns a python list of tag element objects that is iterable using "for" semantics as illustrated for xrec.getCourses() above. As before, the indivdual attributes of each tag instance are accessed as plain data members of their host object. The standard list functions (eg. len(list), str(list), repr(list)) all work as expected for these tag list objects returned by getXXXs() methods. This interface was designed to be pythonic, ie. "there should be only one (obvious) way to do it." so most things should work intuitively. It is especially powerful when combined with pyroot to allow a quick-and-simple prototyping framework for physics analysis.

advanced features of the python API

See section on Advanced features below.

HDDM in C++

If you have access to a hddm file that was written by someone else, copy it into your work directory and use a text editor to extract the header into a file, which you may call "exam.xml". Use the following commands to build the C++ module that you will need to read the contents of this file.

$ hddm-cpp exam.xml
$ mv hddm_x.cpp hddm_x++.cpp
$ g++ -c  hddm_x++.cpp XString.cpp XParsers.cpp md5.c -I $HALLD_HOME/$BMS_OSNAME/include \
-I$XERCESCROOT/include -L $XERCESCROOT/lib -l xerces-c -L $HALLD_HOME/$BMS_OSNAME/lib \
-lxstream -lz -lbz2

The rename step from hddm_x.cpp to hddm_x++.cpp is inserted to prevent any confusion between the files created in this section and those created below for use with the c API, but is not essential if this is the only interface that is of interest to you.

writing hddm files in C++

For this example, I return to the template listed at the top of this page, which I call "exam2.xml". Use the build steps above to build the hddm_x C++ API classes and store them in the libHDDM.a C++ static library. Create a new file and cut/paste the contents of the box below into it, then save it.

#include <fstream>
#include "hddm_x.hpp"
int main()
{
   // build the nodal structure for this record and fill in its values
   hddm_x::HDDM xrec;
   hddm_x::StudentList student = xrec.addStudents();
   student().setName("Humphrey Gaston");
   hddm_x::EnrolledList enrolled = student().addEnrolleds();
   enrolled().setYear(2005);
   enrolled().setSemester(2);
   hddm_x::CourseList course = enrolled().addCourses(3);
   course(0).setCredits(3);
   course(0).setTitle("Beginning Russian");
   course(0).addResults();
   course(0).getResult().setGrade("A-");
   course(0).getResult().setPass(true);
   course(1).setCredits(1);
   course(1).setTitle("Bohemian Poetry");
   course(1).addResults();
   course(1).getResult().setGrade("C");
   course(1).getResult().setPass(1);
   course(2).setCredits(4);
   course(2).setTitle("Developmental Psychology");
   course(2).addResults();
   course(2).getResult().setGrade("B+");
   course(2).getResult().setPass(true);

   std::ofstream ofs(“exam2.hddm”);
   hddm_x::ostream ostr(ofs);
   ostr << xrec;
   xrec.clear();
   return 0;
}

Copy this C++ program to a file called write_exam.cpp and compile it into an executable using a command like the following.

$ g++ -o write_exam write_exam.cpp hddm_x++.o -I. -I $HALLD_HOME/$BMS_OSNAME/include \
-I$XERCESCROOT/include -L $XERCESCROOT/lib -l xerces-c -L $HALLD_HOME/$BMS_OSNAME/lib \
-lxstream -lz -lbz2

This may need to be customized for your own build environment. Once it completes successfully, you will find the executable write_exam in the working directory. Run it as "./write_exam2" and it should create a new hddm file called exam2.hddm. Running "hddm-xml write_exam2.hddm" should produce output like the following.

HDDM
  student name="Humphrey Gaston"
    enrolled semester=2 year=2005
      course credits=3 title="Beginning Russian"
        result Pass=false grade="A-"
      course credits=1 title="Bohemian Poetry"
        result Pass=false grade="C"
      course credits=4 title="Developmental Psychology"
        result Pass=false grade="B+"

The structure of the output record you are going to write is already known to the program because it knows about your template. All that you need to do is to fill in the elements and assign the values of the attributes. You begin by creating an empty record by calling the HDDM() default constructor. Then you populate the structure top-down by calling addXXXs() methods for each tag XXX under that. The name XXXs is the name of the tag element in the template in a capitalized-plural form. The addXXXs() methods take a single optional int argument, which is the number of copies of that element that need to be added (default is 1). They return a subclass of std::list that can be iterated over in the usual fashion, or indexed with operator()(int) to access the individual members of the list. Each of these has addXXXs() methods for each of its contents, and so on down the tree. You can omit whole branches of the tree by simply not calling the corresponding addXXXs() method, although xml rules require that you specify minOccurs="0" for the containing tag in the template if you plan to do that. As soon as a new element list is created, you can fill in the values of its attributes using set<attname> methods, as illustrated in the example, where <attname> is a capitalized version of the names of the attribute in the template.

reading hddm files in C++

For this illustration, I assume you have created the file exam2.hddm using the instructions in the previous section. The following C++ program lets you open this file and extract bits of information from the first record, writing a summary report at the end. Of course, in actual practice a hddm file would contain many records and the analysis would loop over many instances student.

#include <fstream>
#include "hddm_x.hpp"
int main()
{
   std::ifstream ifs("exam2.hddm");
   hddm_x::HDDM xrec;
   hddm_x::istream istr(ifs);
   istr >> xrec;
   hddm_x::CourseList course = xrec.getCourses();
   int total_courses =course.size();
   int total_enrolled = 0;
   int total_credits = 0;
   int total_passed = 0;
   hddm_x::CourseList::iterator iter;
   for (iter = course.begin(); iter != course.end(); ++iter) {
      if (iter->getResult().getPass()) {
         if (iter->getYear() > 1992) {
            total_credits += iter->getCredits();
         }
         ++total_passed;
      }
   }
   std::cout << course().getName() << " enrolled in "
             << total_courses << " courses "
             << "and passed " << total_passed << " of them, " << std::endl
             << "earning a total of " << total_credits
             << " credits." << std::endl;
   return 0;
}

Running the above code should produce output like the following:

Humphrey Gaston enrolled in 3 courses and passed 3 of them,
earning a total of 8 credits.

The istream object itself functions as an iterable in C++ so the construct, "for rec in hddm_x.istream("exam2.hddm"):" would look over all records in the input file with the rec loop variable being the HDDM element from each record in the input file. Likewise, each call to method getXXXs() returns a C++ list of tag element objects that is iterable using "for" semantics as illustrated for xrec.getCourses() above. As before, the indivdual attributes of each tag instance are accessed as plain data members of their host object. The standard list functions (eg. len(list), str(list), repr(list)) all work as expected for these tag list objects returned by getXXXs() methods. This interface was designed to be C++ic, ie. "there should be only one (obvious) way to do it." so most things should work intuitively. It is especially powerful when combined with pyroot to allow a quick-and-simple prototyping framework for physics analysis.

advanced features of the C++ API

See section on Advanced features below.

HDDM in c

Advanced features

References