[uproot] Python で ROOT ファイルを簡単に扱う

python から本家ROOT (pyROOT) を使わずに ROOTファイルを扱う、そんなお話。

Contents

1 背景
2 uproot
3 導入
4 TFile
5 TTree
- 5.1 pandas 形式にする
- 5.2 複数ファイル連結する ~ TChain

背景

オンラインのデータ取得は ROOT/C++ ながら、オフライン解析は python なケースが増えてきた。
python でデータ処理をするなら pandas/numpy 形式にすると扱いやすい。
root_pandas やそのベースとなる root_numpy というものがあったようだが、
以前サラッと見た印象では、少なくとも私にはあまり直感的ではなかったし、
今現在はいずれも更新が停滞し、DEPRECATED/UNMAINTAINED 状態になっているようだ。

今回は、uproot を用いた方法を紹介する。

uproot

uproot は Scikit-HEP project が公開しているパッケージの一つ。
Scikit-HEP は Particle Physics のデータ解析を python でシームレスに行える枠組みを作ることを目的としており、
ROOT との接続は切っても切れない要素であり、uproot はその一部のようだ。
# 他にも iminit とか Goofit とか面白そうなプロジェクトが多数ある。

uproot は ROOTファイルと内包される ROOTオブジェクトとの I/Oに特化したライブラリで、
本家 ROOT (pyROOT) が無くても単独で利用でき、
比較的容易に numpy/pandas にデータを乗せ換えることができる。
以下、uproot 本家の記載。

Uproot is a library for reading and writing ROOT files in pure Python and NumPy.

Unlike the standard C++ ROOT implementation, Uproot is only an I/O library, primarily intended to stream data into machine learning libraries in Python. Unlike PyROOT and root_numpy, Uproot does not depend on C++ ROOT. Instead, it uses Numpy to cast blocks of data from the ROOT file as Numpy arrays.

導入

$ conda install uproot (miniforge/conda-forge) だけでインストールできる。
実は ROOT も conda からインストールできるがそのサイズは数GB。
一方、uproot のtarボールは 220kB?、展開後も 3.3MB? 程度なので、
実に簡単に手軽にすぐ導入できる。

使い方については公式ドキュメントの Getting Started を参照。
以下は主に TTree 扱うために必要な部分の抜粋。

TFile

データを読み出す時は以下。

import uproot
file = uproot.open("data.root")

1 2	import uproot file = uproot.open("data.root")

なお、python ではよくやる方法だが以下のようにするとファイルのクローズなどを意識せずに済む。

with uproot.open("data.root") as file:
    # do something with the 'file'

1 2	with uproot.open("data.root") as file: # do something with the 'file'

TFile::ls() に近いのは以下。

[in]: file.classnames()

[out]: {'tr;1': 'TTree', ...}

[in]: file.classnames()

[out]: {'tr;1': 'TTree', ...}

これで格納されている ROOTオブジェクトとその名前がわかる。
格納されているオブジェクトには名前でアクセスできる。

[in]: file['tr']

[out]: <TTree 'tr' (N branches) at 0xXXXXXXXXXX>

[in]: file['tr']

[out]: <TTree 'tr' (N branches) at 0xXXXXXXXXXX>

或いは、名前さえ知っていれば以下のように記述することもできる。

[in]: uproot.open("data.root:tr")

[out]: <TTree 'tr' (N branches) at 0xXXXXXXXXXX>

[in]: uproot.open("data.root:tr")

[out]: <TTree 'tr' (N branches) at 0xXXXXXXXXXX>

TTree

TTree の Branch など構成を知りたい場合、TTree::Print() に近いのは以下。

[in]:
tr=uproot.open("data.root:tr")
tr.show()

[out]:
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
unixtime             | uint32_t                 | AsDtype('>u4')
val                  | float[4]                 | AsDtype("('>f4', (4,))")

[in]:

tr=uproot.open("data.root:tr")

tr.show()

[out]:

name | typename | interpretation

---------------------+--------------------------+-------------------------------

unixtime | uint32_t | AsDtype('>u4')

val | float[4] | AsDtype("('>f4', (4,))")

或いは tr.typenames() でも良いかもしれない。

pandas 形式にする

pandas.DataFrame 形式で取り扱うには以下のようにする。

[in]: tr.arrays(["unixtime","val"], library="pd")

[out]:
         unixtime      val[0]      val[1]      val[2]      val[3]
0      163xxxxxxx  xxx.xxxxxx  xxx.xxxxxx  xxx.xxxxxx  xxx.xxxxxx
...           ...         ...         ...         ...         ...
n      163xxxxxxx  xxx.xxxxxx  xxx.xxxxxx  xxx.xxxxxx  xxx.xxxxxx

[n rows x 5 columns]

[in]: tr.arrays(["unixtime","val"], library="pd")

[out]:

unixtime val[0] val[1] val[2] val[3]

0 163xxxxxxx xxx.xxxxxx xxx.xxxxxx xxx.xxxxxx xxx.xxxxxx

... ... ... ... ... ...

n 163xxxxxxx xxx.xxxxxx xxx.xxxxxx xxx.xxxxxx xxx.xxxxxx

[n rows x 5 columns]

そのほか、pandas ではなく branch名の付いた numpy の dictionary で受けたり、
特定の branch 単体を numpy で受けたりもできるようだが、
よほどメモリサイズや処理時間が気にならない限りは、冒頭の pandas.DataFrame で受けて処理すれば良いかと思う。
これはちょうど TTree::SetBranchStatus("*",0) などとして TTree::GetEntry(...) の読み込みを限定する感じだろう。

[in]: tr.arrays(['unixtime','val'],library='np')

[out]:
{'unixtime': array([163xxxxxxx, ..., 163xxxxxxx], dtype=uint32),
 'val': array([[xxx.xxxxxx, xxx.xxxxxx,xxx.xxxxxx,xxx.xxxxxx], ...,
       [xxx.xxxxxx, xxx.xxxxxx,xxx.xxxxxx,xxx.xxxxxx]], dtype=float32)}

[in]: tr['unixtime'].array(library='np')

[out]: array([163xxxxxxx, ..., 163xxxxxxx], dtype=uint32)

[in]: tr.arrays(['unixtime','val'],library='np')

[out]:

{'unixtime': array([163xxxxxxx, ..., 163xxxxxxx], dtype=uint32),

'val': array([[xxx.xxxxxx, xxx.xxxxxx,xxx.xxxxxx,xxx.xxxxxx], ...,

[xxx.xxxxxx, xxx.xxxxxx,xxx.xxxxxx,xxx.xxxxxx]], dtype=float32)}

[in]: tr['unixtime'].array(library='np')

[out]: array([163xxxxxxx, ..., 163xxxxxxx], dtype=uint32)

複数ファイル連結する ~ TChain

複数ファイルに分散する同じ構成の TTree を連結する TChain を使う場合に対応。

[in]: uproot.concatenate(["data1.root:tr","data2.root:tr"],filter_name="*",library="pd")

[out]:
         unixtime      val[0]      val[1]      val[2]      val[3]
0      163xxxxxxx  xxx.xxxxxx  xxx.xxxxxx  xxx.xxxxxx  xxx.xxxxxx
...           ...         ...         ...         ...         ...
N      164xxxxxxx  xxx.xxxxxx  xxx.xxxxxx  xxx.xxxxxx  xxx.xxxxxx

[N rows x 5 columns]

[in]: uproot.concatenate(["data1.root:tr","data2.root:tr"],filter_name="*",library="pd")

[out]:

unixtime val[0] val[1] val[2] val[3]

0 163xxxxxxx xxx.xxxxxx xxx.xxxxxx xxx.xxxxxx xxx.xxxxxx

... ... ... ... ... ...

N 164xxxxxxx xxx.xxxxxx xxx.xxxxxx xxx.xxxxxx xxx.xxxxxx

[N rows x 5 columns]