Data Manipulation

PyBNesian implements some useful dataset manipulation techniques such as k-fold cross validation and hold-out.

DataFrame

Internally, PyBNesian uses a pyarrow.RecordBatch to enable a zero-copy data exchange between C++ and Python.

Most of the classes and methods takes as argument, or returns a DataFrame type. This represents an encapsulation of pyarrow.RecordBatch:

DataFrame Operations

class pybnesian.CrossValidation

This class implements k-fold cross-validation, i.e. it splits the data into k disjoint sets of train and test data.

__init__(self: pybnesian.CrossValidation, df: DataFrame, k: int = 10, seed: Optional[int] = None, include_null: bool = False) None

This constructor takes a DataFrame and returns a k-fold cross-validation. It shuffles the data before applying the cross-validation.

Parameters
  • df – A DataFrame.

  • k – Number of folds.

  • seed – A random seed number. If not specified or None, a random seed is generated.

  • include_null – Whether to include the rows where some columns may be null (missing). If false, the rows with some missing values are filtered before performing the cross-validation. Else, all the rows are included.

Raises

ValueError – If k is greater than the number of rows.

__iter__(self: pybnesian.CrossValidation) Iterator

Iterates over the k-fold cross-validation.

Returns

The iterator returns a tuple (DataFrame, DataFrame) which contains the training data and test data of each fold.

>>> from pybnesian import CrossValidation
>>> df = pd.DataFrame({'a': np.random.rand(20), 'b': np.random.rand(20)})
>>> for (training_data, test_data) in CrossValidation(df):
...     assert training_data.num_rows == 18
...     assert test_data.num_rows == 2
fold(self: pybnesian.CrossValidation, index: int) Tuple[DataFrame, DataFrame]

Returns the index-th fold.

Parameters

index – Fold index.

Returns

A tuple (DataFrame, DataFrame) which contains the training data and test data of each fold.

indices(self: pybnesian.CrossValidation) Iterator

Iterates over the row indices of each training and test DataFrame.

Returns

A tuple (list, list) containing the row indices (with respect to the original DataFrame) of the train and test data of each fold.

>>> from pybnesian import CrossValidation
>>> df = pd.DataFrame({'a': np.random.rand(20), 'b': np.random.rand(20)})
>>> for (training_indices, test_indices) in CrossValidation(df).indices():
...     assert set(range(20)) == set(list(training_indices) + list(test_indices))
loc(self: pybnesian.CrossValidation, columns: str or int or List[str] or List[int]) CrossValidation

Selects columns from the CrossValidation object.

Parameters

columns – Columns to select. The columns can be represented by their index (int or List[int]) or by their name (str or List[str]).

Returns

A CrossValidation object with the selected columns.

class pybnesian.HoldOut

This class implements holdout validation, i.e. it splits the data into training and test sets.

__init__(self: pybnesian.HoldOut, df: DataFrame, test_ratio: float = 0.2, seed: Optional[int] = None, include_null: bool = False) None

This constructor takes a DataFrame and returns a split into training an test sets. It shuffles the data before applying the holdout.

Parameters
  • df – A DataFrame.

  • test_ratio – Proportion of instances left for the test data.

  • seed – A random seed number. If not specified or None, a random seed is generated.

  • include_null – Whether to include the rows where some columns may be null (missing). If false, the rows with some missing values are filtered before performing the cross-validation. Else, all the rows are included.

test_data(self: pybnesian.HoldOut) DataFrame

Gets the test data.

Returns

Test data.

training_data(self: pybnesian.HoldOut) DataFrame

Gets the training data.

Returns

Training data.

Dynamic Data

class pybnesian.DynamicDataFrame

This class implements the adaptation of a DynamicDataFrame to a dynamic context (temporal series). This is useful to make easier to learn dynamic Bayesian networks.

A DynamicDataFrame creates columns with different temporal delays from the data in the static DataFrame. Each column in the DynamicDataFrame is named with the following pattern: [variable_name]_t_[temporal_index]. The variable_name is the name of each column in the static DataFrame. The temporal_index is an index with a range [0-markovian_order]. The index “0” is considered the “present”, the index “1” delays the temporal one step into the “past”, and so on…

DynamicDataFrame contains two functions DynamicDataFrame.static_df() and DynamicDataFrame.transition_df() that can be used to learn the static Bayesian network and transition Bayesian network components of a dynamic Bayesian network.

All the operations are implemented using a zero-copy strategy to avoid wasting memory.

>>> from pybnesian import DynamicDataFrame
>>> df = pd.DataFrame({'a': np.arange(10, dtype=float)})
>>> ddf = DynamicDataFrame(df, 2)
>>> ddf.transition_df().to_pandas()
   a_t_0  a_t_1  a_t_2
0    2.0    1.0    0.0
1    3.0    2.0    1.0
2    4.0    3.0    2.0
3    5.0    4.0    3.0
4    6.0    5.0    4.0
5    7.0    6.0    5.0
6    8.0    7.0    6.0
7    9.0    8.0    7.0
>>> ddf.static_df().to_pandas()
   a_t_1  a_t_2
0    1.0    0.0
1    2.0    1.0
2    3.0    2.0
3    4.0    3.0
4    5.0    4.0
5    6.0    5.0
6    7.0    6.0
7    8.0    7.0
8    9.0    8.0
__init__(self: pybnesian.DynamicDataFrame, df: DataFrame, markovian_order: int) None

Creates a DynamicDataFrame from an static DataFrame using a given markovian order.

Parameters
  • df – A DataFrame.

  • markovian_order – Markovian order of the transformation.

loc(self: pybnesian.DynamicDataFrame, columns: DynamicVariable or List[DynamicVariable]) DataFrame

Gets a column or set of columns from the DynamicDataFrame. See DynamicVariable.

Returns

A DataFrame with the selected columns.

>>> from pybnesian import DynamicDataFrame
>>> df = pd.DataFrame({'a': np.arange(10, dtype=float),
...                    'b': np.arange(0, 100, 10, dtype=float)})
>>> ddf = DynamicDataFrame(df, 2)
>>> ddf.loc(("b", 1)).to_pandas()
   b_t_1
0   10.0
1   20.0
2   30.0
3   40.0
4   50.0
5   60.0
6   70.0
7   80.0
>>> ddf.loc([("a", 0), ("b", 1)]).to_pandas()
   a_t_0  b_t_1
0    2.0   10.0
1    3.0   20.0
2    4.0   30.0
3    5.0   40.0
4    6.0   50.0
5    7.0   60.0
6    8.0   70.0
7    9.0   80.0

All the DynamicVariables in the list must be of the same type, so do not mix different types:

>>> ddf.loc([(0, 0), ("b", 1)]) # do NOT do this!

# Either you use names or indices:
>>> ddf.loc([("a", 0), ("b", 1)]) # GOOD
>>> ddf.loc([(0, 1), (1, 1)]) # GOOD
markovian_order(self: pybnesian.DynamicDataFrame) int

Gets the markovian order.

Returns

Markovian order of the DynamicDataFrame.

num_columns(self: pybnesian.DynamicDataFrame) int

Gets the number of columns.

Returns

The number of columns. This is equal to the number of columns of DynamicDataFrame.transition_df().

num_rows(self: pybnesian.DynamicDataFrame) int

Gets the number of row.

Returns

Number of rows.

num_variables(self: pybnesian.DynamicDataFrame) int

Gets the number of variables.

Returns

The number of variables. This is exactly equal to the number of columns in DynamicDataFrame.origin_df().

origin_df(self: pybnesian.DynamicDataFrame) DataFrame

Gets the original DataFrame.

Returns

The DataFrame passed to the constructor of DynamicDataFrame.

static_df(self: pybnesian.DynamicDataFrame) DataFrame

Gets the DataFrame for the static Bayesian network. The static network estimates the probability f(t_1,…, t_[markovian_order]). See DynamicDataFrame example.

Returns

A DataFrame with columns from [variable_name]_t_1 to [variable_name]_t_[markovian_order]

temporal_slice(self: pybnesian.DynamicDataFrame, indices: int or List[int]) DataFrame

Gets a temporal slice or a set of temporal slices. The i-th temporal slice is composed by the columns [variable_name]_t_i

Returns

A DataFrame with the selected temporal slices.

>>> from pybnesian import DynamicDataFrame
>>> df = pd.DataFrame({'a': np.arange(10, dtype=float), 'b': np.arange(0, 100, 10, dtype=float)})
>>> ddf = DynamicDataFrame(df, 2)
>>> ddf.temporal_slice(1).to_pandas()
   a_t_1  b_t_1
0    1.0   10.0
1    2.0   20.0
2    3.0   30.0
3    4.0   40.0
4    5.0   50.0
5    6.0   60.0
6    7.0   70.0
7    8.0   80.0
>>> ddf.temporal_slice([0, 2]).to_pandas()
   a_t_0  b_t_0  a_t_2  b_t_2
0    2.0   20.0    0.0    0.0
1    3.0   30.0    1.0   10.0
2    4.0   40.0    2.0   20.0
3    5.0   50.0    3.0   30.0
4    6.0   60.0    4.0   40.0
5    7.0   70.0    5.0   50.0
6    8.0   80.0    6.0   60.0
7    9.0   90.0    7.0   70.0
transition_df(self: pybnesian.DynamicDataFrame) DataFrame

Gets the DataFrame for the transition Bayesian network. The transition network estimates the conditional probability f(t_0 | t_1, …, t_[markovian_order]). See DynamicDataFrame example.

Returns

A DataFrame with columns from [variable_name]_t_0 to [variable_name]_t_[markovian_order]

class DynamicVariable

A DynamicVariable is the representation of a column in a DynamicDataFrame.

A DynamicVariable is a tuple (variable_index, temporal_index). variable_index is a str or int that represents the name or index of the variable in the original static DataFrame. temporal_index is an int that represents the temporal slice in the DynamicDataFrame. See DynamicDataFrame.loc for usage examples.