RCDB conditions python
Contents
- 1 Introduction
- 2 Installation
- 3 ALL YOU HAVE TO KNOW examples
- 4 Connection
- 5 Data model
- 6 Creating condition types
- 7 Adding data to database
- 8 Replacing previous values
- 9 SQLAlchemy
- 10 Editing or deleting objects
- 11 Database querying
- 12 Logging
- 13 Performance
- 14 Command line tools
- 15 Nightly generated SQLite files
- 16 Support
Introduction
Run conditions is the way to store information related to a run (which is identified by run_number everywhere). From a simplistic point of view, run conditions are presented in RCDB as name-value pairs attached to a run number. For example, event_count = 1663 for run 100.
More versatile options of conditions include:
- A condition can also hold a time information of occurrence name - value (+time)
- Several values could be attached by the same name to the same run. So it looks like name - [(value1, time1), (value2, time2), ... ]
- As opposite, API can ensure that there in strictly one value per run
- Different types of values are supported
This tutorial covers RCDB conditions python API, which provides complete tooling for conditions management.
The API is developed using SQLAlchemy ORM, which unifies workflow for MySQL and SQLite databases
(and many more, actually). RCDB API hides many complexities of SQLAlchemy and provides simple and very
straightforward functions to manage conditions. But users can use all power of SQLAlchemy for querying and
filtering results if they wish.
Lets see how python code would look for the example above. Read event_count for run 100:
import rcdb # Open SQLite database connection db = rcdb.RCDBProvider("sqlite:///path.to.file.db") # Read value for run 100 event_count = db.get_condition(100, "event_count").value
Write event_count=1663 for run 100:
# Once in a lifetime, create a condition type, that defines event_count ct = db.create_condition_type("event_count", ConditionType.INT_FIELD, False) # Write condition value to run 100 db.add_condition(100, "event_count", 1663)
There is a small handy command line tool rcnd, that allows to see RCDB conditions and write values
export RCDB_CONNECTION=mysql://rcdb@localhost/rcdb rcnd --help # Gives you self descriptive help rcnd 1000 event_count # See exact value of 'event_count' for run 1000 rcnd --write 1663 100 event_count # Write condition value to run 100
What RCDB conditions are not designed for? - They are not designed for large data sets that change rarely (value is the same for many runs).
That is because each condition value is independently saved (and attached) for each run.
In the case of bulk data, it is better to save it using other RCDB options. RCDB provides the files saving mechanism as example.
Installation
1. Get rcdb.
RCDB svn is:
https://halldsvn.jlab.org/repos/trunk/online/daq/rcdb/rcdb
2. Set environment.
There are *environment.bash* or *environment.csh* scripts, which automatically set environment variables for the of rcdb
source environment.bash
The script:
- sets $RCDB_HOME - to RCDB root directory,
- appends $PYTHONPATH with $RCDB_HOME/python
- appends $PATH with rcdb bin folder
3.Choose database
The main database is considered to be MySQL in counting house. The connection string is:
mysql://rcdb:<hdops_pwd>@gluondb1/rcdb
SQLite database snapshot is also available at:
/u/group/halld/Software/rcdb
To experiment with RCDB and examples below, there is create_empty_sqlite.py script in $RCDB_HOME/python folder.
The script creates empty sqlite database. The usage is:
python $RCDB_HOME/python/create_empty_sqlite.py path_to_database.db
ALL YOU HAVE TO KNOW examples
Python
At least to start with RCDB conditions, to put values and to get them back:
from datetime import datetime from rcdb.provider import RCDBProvider from rcdb.model import ConditionType # 1. Create RCDBProvider object that connects to DB and provide most of the functions db = RCDBProvider("sqlite:///example.db") # 2. Create condition type. It is done only once db.create_condition_type("my_val", ConditionType.INT_FIELD, is_many_per_run=False, description="This is my value") # 3. Add data to database db.add_condition(1, "my_val", 1000) # Replace previous value db.add_condition(1, "my_val", 2000, replace=True) # 4. Get condition from database condition = db.get_condition(1, "my_val") print condition print "value =", condition.value print "name =", condition.name
The script result:
<Condition id='1', run_number='1', value=2000> value = 2000 name = my_val
More actions on objects:
# 5. Get all existing conditions names and their descriptions for ct in db.get_condition_types(): print ct.name, ':', ct.description
The script result:
my_val : This is my value
# 6. Get all values for the run 1 run = db.get_run(1) print "Conditions for run {}".format(run.number) for condition in run.conditions: print condition.name, '=', condition.value
The script result:
my_val = 2000
The example also available as:
$RCDB_HOME/python/example_conditions_basic.py
It is assumed that 'example.db' is SQLite database, created by *create_empty_sqlite.py* script. To run it:
python $RCDB_HOME/python/create_empty_sqlite.py example.db python $RCDB_HOME/python/example_conditions_basic.py
(!) note that to run the script again you probably have to delete the database rm example.db
The next sections will cover this example and give thorough explanation on what is here.
Command line tools
Command line tools provide less possibilities for data manipulation than python API at the moment.
export RCDB_CONNECTION=mysql://rcdb@localhost/rcdb rcnd --help # Gives you self descriptive help rcnd -c mysql://rcdb@localhost/rcdb # -c flag sets connection string from command line instead of environment rcnd # Gives database statistics, number of runs and conditions rcnd 1000 # See all recorded values for run 1000 rcnd 1000 event_count # See exact value of 'event_count' for run 1000 # Creating condition type (need to be done once) rcnd --create my_value --type string --description "This is my value" # Write value for run 1000 for condition 'my_value' rcnd --write "value to write" --replace 1000 my_value # See all condition names and types in DB rcnd --list
More information and examples are in #Command line tools section below.
Connection
db = RCDBProvider("sqlite:///example.db")
RCDBProvider is an object that holds database session and provides connect/disconnect functions. It uses connection strings to pass database parameters to the class. It also also carry functions to manage run condition and other RCDB data.
The functions usually return database model objects (described right in the next section).
Additional manipulations over this objects could be done with SQLAlchemy (described later).
For now we consider to use MySQL and SQLite databases. The connection strings for them are:
MySQL
mysql://user_name:password@host:port/database
SQLite
sqlite:///path_to_file
(!) Note that because SQLite doesn't have user_name and password, it starts with three slashes ///. And thus there are four slashes //// in absolute path to file.
sqlite:////home/user/example.db
More about connections could be found in
[SQLAlchemy documentation]
In the example above class constructor is used to connect to database. But there are more connection functions:
# Create provider without connecting db = RCDBProvider() # Connect to database db.connect("sqlite:///example.db") # check connection and get connection string from provider if db.is_connected: print "connected to:", db.connection_string #disconnect from DB db.disconnect()
(!) Note that connect function doesn't really connect to database. It just creates so called engine and session objects using the connection string. Thus, connect function raises exceptions if the connection string has wrong format or there is no required libraries in the system. But if there is no physical connection to MySQL or there is no such SQLite file, the function doesn't raise eny errors. The errors are raised on first data retrieval in such case.
Data model
Database structure
At the database level conditions part presented as 3 tables:
RUNS CONDITIONS CONDITION_TYPES number <-- run_num name type_id --> field_type *_value is_many_per_run time
So when we talk about name-value pair for the run, this actually means that:
- Run number and other run information (like times of start and end) is stored in the runs table.
- Names and type of value are stored in the condition_types table.
- And, finally, values are stored in the conditions table, each record of it is referenced to a run and to a condition_type.
Python class structure
Python API data model classes resembles this structure. There are 3 python classes that you work with:
- Run - represents run
- Condition - stores data for the run
- ConditionType - stores condition name, field type and other
All classes have properties to reference each other. The main properties for conditions management are:
class Run(ModelBase): number # int - The run number start_time # datetime - Run start time end_time # datetime - Run end time conditions # list[Condition] - Conditions associated with the run class ConditionType(ModelBase): name # str(max 255) - A name of condition value_type # str(max 255) - Type name. One of XXX_FIELD (see below) is_many_per_run # bool- True if the value is allowed many times per run values # query[Condition] - query to look condition values for runs # Constants, used for declaration of value_type STRING_FIELD = "string" INT_FIELD = "int" BOOL_FIELD = "bool" FLOAT_FIELD = "float" JSON_FIELD = "json" BLOB_FIELD = "blob" TIME_FIELD = "time" class Condition(ModelBase): time # datetime - time related to condition (when it occurred in example) run_number # int - the run number @property value # int, float, bool or string - depending on type. The condition value text_value # holds data if type STRING_FIELD,JSON_FIELD or BLOB_FIELD int_value # holds data if type INT_FIELD float_value # holds data if type FLOAT_FIELD bool_value # holds data if type BOOL_FIELD run # Run - Run object associated with the run_number type # ConditionType - link to associated condition type name # str - link to type.name. See ConditionType.name value_type # str - link to type.value_type. See ConditionType.value_type
How data is stored in the DB
As you may noticed from comments above, in reality data is stored in one of the fields:
Storage field | Value type |
---|---|
text_value | STRING_FIELD, JSON_FIELD or BLOB_FIELD |
int_value | INT_FIELD |
float_value | FLOAT_FIELD |
bool_value | BOOL_FIELD |
When you call Condition.value property, Condition class checks for type.value_type and returns an appropriate xxx_value.
Why is it so? - because we would like to have queries like: "give me runs where event_count > 100 000"
i.e., if we know that event_count is int, we would like database to operate it as int.
At the same time we would like to store strings and more general data with blobs. To have it, RCDB uses so called "hybrid approach to object-attribute-value model". If value is int, float, bool or time, it is stored in appropriate field, which allows to use its type when querying. Finally it is possible search over ints, floats and time and, at the same time, to store more complex objects as JSON or blobs... to figure out them lately
Creating condition types
To save data in run conditions, a "condition type" should be created first. It is done once in a database lifetime. Lets look create_condition_type from the example above (we add parameter names here):
db.create_condition_type(name="my_val", value_type=ConditionType.INT_FIELD, is_many_per_run=False, description="This is my value")
name - The first parameter is condition name. When we say "event_count for run 100", "event_count" is that name.
Names are case sensitive. The API doesn't validate names for any name convension and there is no built in checking for
spaces. But spaces would definitely make problems so are not recommended.
It is possible to have names like:
category/sub/name category-sub-name category-sub_name
Names are just strings. RCDB doesn't provide special treatment of slashes '/' or directories.
value_type - The second parameter defines type of the value. It can be one of:
- ConditionType.STRING_FIELD
- ConditionType.INT_FIELD
- ConditionType.BOOL_FIELD
- ConditionType.FLOAT_FIELD
- ConditionType.TIME_FIELD
- ConditionType.JSON_FIELD
- ConditionType.BLOB_FIELD
More examples of how to use types are presented in the next section
is_many_per_run - Allows to store many values with different time for the same run
- False - API works as name - value(time), i.e. it checks that there is only one value per run
- True - API allows name - [(value1, time1), (value2, time2), ...] scheme.
Explanation - There are two different behaviours that are assumed for run conditions: Sometimes it is intended to
have strictly one name-value for a run. "total_events" or "target_material" are the examples. If
is_many_per_run=False, then API checks that there is only one value per run. But the sometimes it is
desirable to track value change during a run. Hall "temperature" or "current" are those examples.
If is_many_per_run=True, then API allows to set several values for different times under the same name for the same run
More examples on it is given in #Replacing previous values
description - 255 chars max human readable description, that other users can see. It is optional but it is very
good practice to fill it.
Adding data to database
Basic types: int, float, bool, string
To store basic types one of the fields should be used:
- ConditionType.STRING_FIELD
- ConditionType.INT_FIELD
- ConditionType.BOOL_FIELD
- ConditionType.FLOAT_FIELD
Lets example it:
# Create RCDBProvider provider object and connect it to DB db = RCDBProvider("sqlite:///example.db") # Crete condition types db.create_condition_type("int_val", ConditionType.INT_FIELD, False) db.create_condition_type("float_val", ConditionType.FLOAT_FIELD, False) db.create_condition_type("bool_val", ConditionType.BOOL_FIELD, False) db.create_condition_type("string_val", ConditionType.STRING_FIELD, False) # Add values to run 1 db.add_condition(1, "int_val", 1000) db.add_condition(1, "float_val", 2.5) db.add_condition(1, "bool_val", True) db.add_condition(1, "string_val", "test test") # Read values for run 1 and use them condition = db.get_condition(1, "int_val") print condition.value condition = db.get_condition(1, "float_val") print condition.value condition = db.get_condition(1, "bool_val") print condition.value condition = db.get_condition(1, "string_val") print condition.value
The output:
1000 2.5 True test test
Time information
A time information can be attached to any condition value. Standard python datetime is used for that: (Lets see the first example):
# Create condition type db.create_condition_type("my_val", ConditionType.INT_FIELD, False) # Add value and time information db.add_condition(1, "my_val", 2000, datetime(2015, 10, 10, 15, 28, 12, 111111)) # Get condition from database condition = db.get_condition(1, "my_val") print condition print "value =", condition.value print "name =", condition.name print "time =", condition.time
The output is:
<Condition id='1', run_number='1', value=2000> value = 2000 name = my_val time = 2015-10-10 15:28:12.111111
If time is the only relevant information for a condition, then ConditionType.TIME_FIELD type can be used to create
the condition type. In this case Condition.value field will have time information and time can be passed as
value parameter of add_condition function:
db.create_condition_type("lunch_bell_rang", ConditionType.TIME_FIELD, False) # add value to run 1 time = datetime(2015, 9, 1, 14, 21, 01) db.add_condition(1, "lunch_bell_rang", time) # get from DB val = self.db.get_condition(1, "lunch_bell_rang") print val.value print val.time
The output is:
2015-09-01 14:21:01 2015-09-01 14:21:01
Note that val.value and val.time are the same in this example.
Multiple values per run
To add many values of the same type, is_many_per_run parameter of create_condition_type function should be set to True. Then you are able to add many condition values per one run, but specifying time for each of them.
(!) if is_many_per_run=True, then get_condition returns a list of Condition objects. <inc>Even</inc>
if there is only one object selected.
Example
# Many condition values allowed for the run (is_many_per_run=True) # 1. If run has this condition, with the same value and actual_time the func. DOES NOTHING # 2. If run has this conditions but at different time, it adds this condition to DB db.create_condition_type("multi", ConditionType.INT_FIELD, True) time1 = datetime(2015,9,1,14,21,01, 222) time2 = datetime(2015,9,1,14,21,01, 333) # First addition to DB. Time is None db.add_condition(1, "multi", 2222) # Ok. Value for time1 is added to DB db.add_condition(1, "multi", 3333, time1) db.add_condition(1, "multi", 4444, time2) results = db.get_condition(1, "multi") # We should get 3 values as: # 0: value=2222; time=None # 1: value=3333; time=time1 # 2: value=4444; time=time2 # lets check it print results values = [result.value for result in results] times = [result.time for result in results] print values print times
The output:
[<Condition id='1', run_number='1', value=2222>, <Condition id='2', run_number='1', value=3333>, <Condition id='3', run_number='1', value=4444>] [2222, 3333, 4444] [None, datetime(2015, 9, 1, 14, 21, 1, 222), datetime(2015, 9, 1, 14, 21, 1, 333)]
Arrays and dictionaries
Multiple values per run are NOT intended to store arrays of data.
Best way to store arrays and dictionaries is serializing them to JSON. Use ConditionType.JSON_FIELD for that.
RCDB conditions API doesn't provide mechanisms of converting objects to JSON and from JSON.
For arrays it is done easily by json module.
The example from [python 2.7 documentation]:
>>> import json >>> json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}]) '["foo", {"bar": ["baz", null, 1.0, 2]}]' >>> json.loads('["foo", {"bar":["baz", null, 1.0, 2]}]') [u'foo', {u'bar': [u'baz', None, 1.0, 2]}]
So, serialization is on your side. It is done to have a better control over serialization. This means that if condition type is JSON_FIELD, add_condition function awaits string and after you get condition back, Condition.value contains string.
Example:
import json from rcdb.provider import RCDBProvider from rcdb.model import ConditionType # Create RCDBProvider provider object and connect it to DB db = RCDBProvider("sqlite:///example.db") # Create condition type db.create_condition_type("list_data", ConditionType.JSON_FIELD, False) db.create_condition_type("dict_data", ConditionType.JSON_FIELD, False) list_to_store = [1, 2, 3] dict_to_store = {"x": 1, "y": 2, "z": 3} # Dump values to JSON and save it to DB to run 1 db.add_condition(1, "list_data", json.dumps(list_to_store)) db.add_condition(1, "dict_data", json.dumps(dict_to_store)) # Get condition from database restored_list = json.loads(db.get_condition(1, "list_data").value) restored_dict = json.loads(db.get_condition(1, "dict_data").value) print restored_list print restored_dict print restored_dict["x"] print restored_dict["y"] print restored_dict["z"]
The output is:
[1, 2, 3] {u'y': 2, u'x': 1, u'z': 3} 1 2 3
The example is located at
$RCDB_HOME/python/example_conditions_store_array.py
and can be run as:
python $RCDB_HOME/python/create_empty_sqlite.py example.db python $RCDB_HOME/python/example_conditions_store_array.py
As one can mention unicode string is returned as unicode after json deserialization (look at u"x" instead of just "x"). It is not a problem if you just work with this array, because python acts seamlessly with unicode strings. As you can see in example, we use usual string "x" in restored_dict["x"] and it just works.
If it is a problem, there is a [stackoverlow question on that]
Using pyYAML to deserialize to strings looks easy.
Custom python objects
To save custom python objects to database, jsonpickle package could be used. It is an open source project available via pip install. It is not shipped with RCDB at the moment.
from rcdb.provider import RCDBProvider from rcdb.model import ConditionType import jsonpickle class Cat(object): def __init__(self, name): self.name = name self.mice_eaten = 1230 # Create RCDBProvider provider object and connect it to DB db = RCDBProvider("sqlite:///example.db") # Create condition type db.create_condition_type("cat", ConditionType.JSON_FIELD, False) # Create a cat and store in in the DB for run 1 cat = Cat('Alice') db.add_condition(1, "cat", jsonpickle.encode(cat)) # Get condition from database for run 1 condition = db.get_condition(1, "cat") loaded_cat = jsonpickle.decode(condition.value) print "How cat is stored in DB:" print condition.value print "Deserialized cat:" print "name:", loaded_cat.name print "mice_eaten:", loaded_cat.mice_eaten
The result:
How cat is stored in DB: {"py/object": "__main__.Cat", "name": "Alice", "mice_eaten": 1230} Deserialized cat: name: Alice mice_eaten: 1230
jsonpickle installation:
system level:
pip install jsonpickle
user level:
pip install --user jsonpickle
STRING_FIELD vs. JSON_FIELD vs. BLOB_FIELD
What if data doesn't fit into the string or JSON? There is ConditionType.BLOB_FIELD type.
Concise instruction is much like JSON:
- Set condition type as BLOB_FIELD
- You serialize object whatever you like
- Save it to DB as string
- Load from DB
- Deserialize whatever you like
But what is the difference between STRING_FIELD, JSON_FIELD and BLOB_FIELD?
There is no difference in terms of storing the data. A Condition class, same as a database table, has text_value
field where text/string data is stored. The ONLY difference is how this fields are treated and presented in GUI.
- STRING_FIELD - is considered to be a human readable string.
- JSON_FIELD - is considered to be JSON, which is colored and formatted accordingly
- BLOB_FIELD - is considered to be neither very readable string nor JSON. But it is still should converted to some string. And I hope it will never be used.
Replacing previous values
What if the condition value for this run with this name already exists in the DB?
In general, to replace value replace=True parameter should be set in add_condition.
For single value per run: 1. If run has this condition, with the same value and time, exception is not raised and function does nothing. 2. If value OR actual_time is different than in DB, function checks 'replace' flag and behave accordingly to it
Example:
db.add_condition(1, "event_count", 1000) # First addition to DB db.add_condition(1, "event_count", 1000) # Ok. Do nothing, such value already exists db.add_condition(1, "event_count", 2222) # Error. OverrideConditionValueError db.add_condition(1, "event_count", 2222, replace=True) # Ok. Replacing existing value print(db.get_condition(1, "event_count")) # value: 2222 # time: None time1 = datetime(2015,9,1,14,21,01, 222) time2 = datetime(2015,9,1,14,21,01, 333) db.add_condition(1, "timed", 1, time1) # First addition to DB db.add_condition(1, "timed", 1, time1) # Ok. Do nothing db.add_condition(1, "timed", 1, time2) # Error. Time is different db.add_condition(1, "timed", 5, time1) # Error. Value is different db.add_condition(1, "timed", 5, time2, True) # Ok. Value replaced print(db.get_condition(1, "timed")) # value: 5 # time: time2
If many condition values allowed for the run (is_many_per_run=True)
- If run has this condition, with the same value and same time the func. DOES NOTHING
- If run has this conditions but at different time, it adds this condition to DB
- If run has this condition at this time
Example:
time1 = datetime(2015,9,1,14,21,01, 222) time2 = datetime(2015,9,1,14,21,01, 333) db.add_condition(1, "event_count", 1000) # First addition to DB. Time is None db.add_condition(1, "event_count", 1000) # Ok. Do nothing, such value already exists db.add_condition(1, "event_count", 2222) # Error. Another value for time None db.add_condition(1, "event_count", 2222, replace=True) # Ok. Replacing existing value for time None db.add_condition(1, "event_count", 3333, time1) # Ok. Value for time1 is added to DB db.add_condition(1, "event_count", 4444, time1) # Error. Value differs for time1 db.add_condition(1, "event_count", 4444, time2) # Ok. Add 444 for time2 to DB print(db.get_condition(1, "event_count")) # [0: value=2222; time=None # 1: value=3333; time=time1 # 2: value=4444; time=time2]
SQLAlchemy
SQLAlchemy makes link between python classes and related database tables. It loads data from DB to classes and when objects are changed, can commit changes back to DB. Also SQLAlchemy glues the classes and makes it possible to navigate between objects.
Lets see a code example:
# open database db = rcdb.RCDBProvider("sqlite:///example.db") # get Run object for the run number 1 run = db.get_run(1) # now we have access to all conditions for that run as run.conditions # get all condition names or all condition values names = [condition.name for condition in run.conditions] values = [condition.values for condition in run.conditions]
SQLAlchemy makes queries to database if needed. So when you do run = self.db.get_run(1)
, Run.conditions
collection is not yet loaded from DB. It actually isn't loaded even when we do like x=run.conditions. But first time
when a real value is needed, database is queried for all conditions for that run.
Editing or deleting objects
Even if overriding of existing values are possible for RCDB, deleting data or editing existing condition types considered to be avoided. But sometimes it is needed. Especially at the development/debugging phase.
To edit or delete things SQLAlchemy session object can be used.
Editing
Edit condition type
# get condition type condition_type = db.get_condition_type("my_var") # Change what you need condition_type.value_type = ConditionType.JSON_FIELD # Calling session commit will save changes to database db.session.commit()
Rename condition
# get condition type condition_type = db.get_condition_type("my_var") # Change what you need condition_type.name = "new_var" # Calling session commit will save changes to database db.session.commit()
The magic is that all data for all runs are now accessible by new_var
Deleting
Deleting objects is done with session.delete function:
# Edit condition type condition_type = db.get_condition_type("my_var") # mark the object for deletion db.session.delete(condition_type) # Calling session commit will save changes to database db.session.commit()
More about session and SQLAlchemy objects manipulation with it can be found in [SQLAlchemy documentation]
Database querying
Working with runs
If you ever want to get Run object by run_number here is how:
run = db.get_run(run_number) print run.number print run.start_time print run.end_time print run.conditions... # but it is written further
How to query runs is shown far below
Get runs by number (or intruduction to SQLAlchemy queries)
Lets select all runs with run_number < 100 using SQLAlchemy
# open database db = rcdb.RCDBProvider("sqlite:///example.db") # create query query = db.session.query(Run).filter(Run.number < 100) # get count of selected runs print query.count() # get first run from selected print query.first() # get all run that matches the creteria print query.all()
What happened?
db.session - gets SQLAlchemy session object
.query(Run) - here we say, that we want Run objects to be returned. At the same time we say what table we want to query
.filter(Run.number < 100) - filtering clause
When we've got query ready, we can actually get objects by query.first()
or query.all()
(there are actually more) or just count number of runs by query.count()
We can use Run.conditions to get conditions for each run. Lets see more advanced example
# open database db = rcdb.RCDBProvider("sqlite:///example.db") # create query query = db.session.query(Run) .filter(Run.number.between(50,55) .order_by(desc(Run.number)) # get all such runs runs = query.all() for run in runs: event_count, = (condition.value for condition in run.conditions if condition.name=='event_count')
It works and looks easy. But there is one drawback, each selected run will call one SELECT QUERY to DB to get its conditions. If might be OK for many cases.
Raw SQLAlchemy queries
What if we want to select runs by conditions value?
First, lets say, that if RCDBProvider gives access to SQLAlchemy session, then it is possible to make use of full
power of SQLAlchemy queries.
Lets say, we want to get all runs with event_count > 100 000
# open database db = rcdb.RCDBProvider("sqlite:///example.db") # create query query = db.session.query(Run).join(Run.conditions).join(Condition.type)\ .filter(ConditionType.name == "event_count")\ .filter(Condition.int_value > 100 000)\ .order_by(Run.number) # get count of selected runs print query.count() # get first run from selected print query.first() # get all run that matches the creteria print query.all()
What happened here.
By first line:
query = db.session.query(Run).join(Run.conditions).join(Condition.type)\
we say, that we would like to select Run objects (.query(Run)), and also that we will use conditions and condition types (.join(Run.conditions).join(Condition.type)).
Then we filter results (.filter(...)) and ask results to by ordered by Run.number (.order_by(Run.number))
All these functions (join, filter, order_by, ...) returns Query object, that allows to stack them as many as needed.
Finally, to get the results, one of query.count(), query.first(), query.one() or query.all() is called.
But probably you already feel drawbacks of this approach:
- First, you see that you have to use int_value to filter conditions. That by many means worse than using Condition.value property, that handles type automatically.
- Another drawback is that when you add more logic, the query becomes bulky.
Lets imagine next example. We look for run in range 1000 to 2000 with event_count > 10000, some data_value in range 1.2 and 2.4
query = db.session.query(Run).join(Run.conditions).join(Condition.type)\ .filter(Run.number.between(1000, 2000)\ .filter(((ConditionType.name == "event_count") & (Condition.int_value > 10000)) | ((ConditionType.name == "data_value") & (Condition.float_value.between(1.2, 2.4))))\ .order_by(Run.number) print query.all()
Note that instead of common && and ||, & and | is used.
SQLAlchemy overloads this operators to use for comparison.
Note also, that such expressions should be in parentheses. It is possible to use or_ and and_ functions instead, but it doesn't improve the readability.
Querying using RCDB helpers
RCDB ConditionType provide helpful properties to make querying easier.
# get condition type t = db.get_condition_type("event_count") # select runs where event_count > 1000 query = t.run_query.filter(t.value_field > 1000) print query.all()
What happened?
- run_query - returns query bootstrap that selects Run objects for given type. So it hides this thing from the raw query above:
....query(Run).join(Run.conditions).join(Condition.type) ... .filter(((ConditionType.name == "event_count")
- value_field - returns the right Condition.xxx_value for a given type. When you put t.value_field > 1000 here, ConditionType t looked at his value_type and selected the right Condition.int_value to compare
But there is a limitation. Each condition type should has its own query. But queries can be combined by union or
intersect methods later.
Lets look at the example, where we fill DB with dummy data and then query for runs using the helper properties. The same example can be found in $RCDB_HOME/python/example_conditions_query.py
# create in memory SQLite database db = rcdb.RCDBProvider("sqlite://") rcdb.model.Base.metadata.create_all(db.engine) # create conditions types event_count_type = db.create_condition_type("event_count", ConditionType.INT_FIELD, False) data_value_type = db.create_condition_type("data_value", ConditionType.FLOAT_FIELD, False) # create runs and fill values for i in range(0, 100): db.create_run(i) db.add_condition(i, event_count_type, i + 950) #event_count in range 950 - 1049 db.add_condition(i, data_value_type, (i/100.0) + 1) #data_value in 1 - 2 """ Demonstrates ConditionType query helpers""" event_count_type = db.get_condition_type("event_count") data_value_type = db.get_condition_type("data_value") # select runs where event_count > 1000 query = event_count_type.run_query.filter(event_count_type.value_field > 1000).filter(Run.number <=53) print query.all() # select runs where 1.52 < data_value < 1.7 query2 = data_value_type.run_query .filter(data_value_type.value_field.between(1.52, 1.7))\ .filter(Run.number < 55) print query2.all() # combine results of this two queries print "Results intersect:" print query.intersect(query2).all() print "Results union:" print query.union(query2).all()
The output is:
[<Run number='51'>, <Run number='52'>, <Run number='53'>] [<Run number='52'>, <Run number='53'>, <Run number='54'>] Results intersect: [<Run number='52'>, <Run number='53'>] Results union: [<Run number='51'>, <Run number='52'>, <Run number='53'>, <Run number='54'>]
More on SQLAlchemy queries in
SQLAlchemy querying tutorial
SQLAlchemy Query API
The example is available as
python $RCDB_HOME/python/example_conditions_query.py
(It creates inmemory database so there is no need in creaty_empty_sqlite.py)
Logging
RCDB have a logging system which stores some information about what is going on in the same database in *'log_records'* table.
Set RCDB_USER environment variable to have your name in logs (or set it manually in API as shown below)
- Creating condition types goes to log automatically
- All condition values manipulations are not logged
It is done in assumption, that the database has many runs and each run has many condition values, so if each condition value creation will have text log message, the database will be bloated with log records.
From the other point of view, when you do a series of operations with conditions it may be a good idea to left a
log message that could be seen by other users.
Custom data modification by SQLAlchemy, like creating or deleting objects manually with session.commit() is not
logged too, so log notification is left to user here too.
How to left a log record:
# set RCDB_USER environment variable to give RCDB you user name # another option is to give it in constructor db = RCDBProvider("sqlite:///example.db", user_name="john") # and one more option of setting user name db.user_name = "john" # simplest log version db.add_log_record(None, "Hello everybody! You'll see this message in logs on RCDB site", 0)
First None means there is no specific database object ID for this message. The last '0' means there is no specific run number for this message
Performance
Reusing objects
Most of the API functions (like add_condition(...)
or get_condition(...)
) can accept model objects as
parameters:
# 1. Using run number and condition name db.add_condition(1, "my_value", 10) # 2. Using model objects run = db.get_run(1) ct = db.get_condition_type("my_value") db.add_condition(run, ct, 10)
When you do db.add_condition(1, "my_value", 10)
condition type and run are queried inside a function. If you do several actions with one object, like adding many conditions for one run or adding one condition to many runs, reusing the object could boost performance up to 30% each.
Auto commit value addition
Performance study shows, that approximately 50% of the time spent in add_condition(...)
is used to commit changes to DB.
To speed up conditions addition add_condition(...)
function has auto_commit optional argument.
By default it is True, changes are committed to DB, if add_condition call is successful.
Setting auto_commit=False allows to defer commit, changes are pending in SQLAlchemy cache and can be committed
manually later.
auto_commit=False purposes are:
- Make a lot of changes and commit them at one time gaining performance
- Rollback changes
To commit changes, having db = RCDBProvider(...)
you should call db.session.commit()
""" Test auto_commit feature that allows to commit changes to DB later""" ct = self.db.create_condition_type("ac", ConditionType.INT_FIELD, False) # Add condition to addition but don't commit changes self.db.add_condition(1, ct, 10, auto_commit=False) # But the object is selectable already val = self.db.get_condition(1, ct) self.assertEqual(val.value, 10) # Commit session. Now "ac"=10 is stored in the DB self.db.session.commit() # Now we deffer committing changes to DB. Object is in SQLAlchemy cache self.db.add_condition(1, ct, 20, None, True, False) self.db.add_condition(1, ct, 30, None, True, False) # If we select this object, SQLAlchemy gives us changed version val = self.db.get_condition(1, ct) self.assertEqual(val.value, 30) # Roll back changes self.db.session.rollback() val = self.db.get_condition(1, ct) self.assertEqual(val.value, 10)
The example is available in tests:
$RCDB_HOME/python/tests/test_conditions.py
(!) note at the same time, that more complex scenarios with not committed objects haven't been tested.
Command line tools
While ccdb like shell is still in progress, you can introspect and manipulate with run conditions using rcnd tool. The tool is added to the PATH after environment.bash(csh) from RCDB_HOME folder is sourced. It is, actually, placed in the same place as the environment.bash.
(!) rcnd doesn't offer all possible data manipulations
> export RCDB_CONNECTION=mysql://rcdb@localhost/rcdb > rcnd --help # Gives you self descriptive help > rcnd -c mysql://rcdb@localhost/rcdb # -c flag sets connection string from command line > rcnd # Gives database statistics, number of runs and conditions
Output
Runs total: 1387 Last run : 2472 Condition types total: 9 Conditions: components component_stats ...
Getting condition names and info
To get all conditions -l or --list flags are to be used. It shows condition names, types and descriptions (if exists):
> rcnd -l components (json) component_stats (json) event_count (int) - Run events count event_rate (float) - Events per sec. ...
To get names only use --list-names:
> rcnd --list-names components component_stats event_count event_rate ...
Getting value by the run number
To see all conditions and values for a run:
> rcnd 1000 # See all recorded values for run 1000 components = (json){"ROCBCAL2": "ROC", "ROCBCAL3": "ROC", "ROCBCAL1":... component_stats = (json){"ROCBCAL2": {"evt-number": 487, "data-rate": 300.... event_count = 487 rtvs = (json){"%(CODA_ROL1)": "/home/hdops/CDAQ/daq_dev_v0.31/d... run_config = 'pulser.conf' run_type = 'hd_bcal_n.ti' ...
Add name to get value of the only condition:
> rcnd 1000 event_count 487 > rcnd 1000 components {"ROCBCAL2": "ROC", "ROCBCAL3": "ROC"}
Writing data
Creating condition type (need to be done once):
> rcnd --create my_value --type string --description "This is my value" ConditionType created with name='my_value', type='string', is_many_per_run='False'
Where --type is:
- bool, int, float, string - basic types. float is the default
- json - to store arrays or custom objects
- time - to store just time. (You can alwais add time information to any other type)
- blob - binary blob. Don't use it if possible
Names policy (not strict at all):
- Don't use spaces. Use '_' instead
- Full words are better. So 'event_count' is better than evt_cnt
- Max name is 255 character. But please, make them shorter
Write value for run 1000 for condition 'my_value'
> rcnd --write "value to write" --replace 1000 my_value Written 'my_value' to run number 1000
Without --replace error is raised, if run 1000 already have different value for 'my_value'
Nightly generated SQLite files
An SQLite database is created at midnight. The latest version is available for download at https://halldweb.jlab.org/dist/rcdb.sqlite . Older versions of the RCDB can be found at JLab in /group/halld/Software/calib/rcdb_sqlite . See the section below on archiving old versions for more details.
The database is created by a cron job runing nightly from the "gluex" account on jlabl1.jlab.org. The following script is used:
/home/gluex/bin/rcdb_sqlite_create.sh
This script is kept in the subversion repository here:
https://halldsvn.jlab.org/repos/trunk/home/gluex/bin
The cron script in turn invokes a script that comes with CCDB with the following command:
$CCDB_HOME/scripts/mysql2sqlite/mysql2sqlite.sh -hhalldweb1 -urcdb rcdb | sqlite3 $sqlite_file
where the $sqlite_file variable is set in the rcdb_sqlite_create.sh script. The setting is based on the current date.
Archive of SQLite Files
The older versions kept at JLab in /group/halld/Software/calib/rcdb_sqlite are classified and automatically deleted as follows:
Classification | Day of the Month | Deletion Time |
---|---|---|
Monthly | 7th | one year after last access |
Weekly | 7th, 14th, 21st, 28th | one month after last access |
Daily | every day | one week after last access |
Support
Dmitry Romanov <romanov@jlab.org>
DescriptionDescription of how to manage RCDB run conditions using python API