Data Manipulation¶

The pybnesian.dataset module implements some useful dataset manipulation techniques such as k-fold cross validation and hold-out.

DataFrame¶

Internally, PyBNesian uses a pyarrow.RecordBatch to enable a zero-copy data exchange between C++ and Python.

Most of the classes and methods takes as argument, or returns a DataFrame type. This represents an encapsulation of pyarrow.RecordBatch:

When a DataFrame is taken as argument in a function, both a pyarrow.RecordBatch or a pandas.DataFrame can be used as a parameter.
When PyBNesian specifies a DataFrame return type, a pyarrow.RecordBatch is returned. This can be converted easily to a pandas.DataFrame using pyarrow.RecordBatch.to_pandas().

DataFrame Operations¶

class pybnesian.dataset.CrossValidation¶

This class implements k-fold cross-validation, i.e. it splits the data into k disjoint sets of train and test data.

__init__(self: pybnesian.dataset.CrossValidation, df: DataFrame, k: int = 10, seed: Optional[int] = None, include_null: bool = False) → None ¶

This constructor takes a DataFrame and returns a k-fold cross-validation. It shuffles the data before applying the cross-validation.

Parameters

df – A DataFrame.
k – Number of folds.
seed – A random seed number. If not specified or None, a random seed is generated.
include_null – Whether to include the rows where some columns may be null (missing). If false, the rows with some missing values are filtered before performing the cross-validation. Else, all the rows are included.

Raises

ValueError – If k is greater than the number of rows.

__iter__(self: pybnesian.dataset.CrossValidation) → Iterator¶

Iterates over the k-fold cross-validation.

Returns: The iterator returns a tuple (DataFrame, DataFrame) which contains the training data and test data of each fold.

>>> from pybnesian.dataset import CrossValidation
>>> df = pd.DataFrame({'a': np.random.rand(20), 'b': np.random.rand(20)})
>>> for (training_data, test_data) in CrossValidation(df):
...     assert training_data.num_rows == 18
...     assert test_data.num_rows == 2

fold(self: pybnesian.dataset.CrossValidation, index: int) → Tuple[DataFrame, DataFrame]¶

Returns the index-th fold.

Parameters: index – Fold index.
Returns: A tuple (DataFrame, DataFrame) which contains the training data and test data of each fold.

indices(self: pybnesian.dataset.CrossValidation) → Iterator¶

Iterates over the row indices of each training and test DataFrame.

Returns: A tuple (list, list) containing the row indices (with respect to the original DataFrame) of the train and test data of each fold.

>>> from pybnesian.dataset import CrossValidation
>>> df = pd.DataFrame({'a': np.random.rand(20), 'b': np.random.rand(20)})
>>> for (training_indices, test_indices) in CrossValidation(df).indices():
...     assert set(range(20)) == set(list(training_indices) + list(test_indices))

loc(self: pybnesian.dataset.CrossValidation, columns: str or int or List[str] or List[int]) → CrossValidation ¶

Selects columns from the CrossValidation object.

Parameters: columns – Columns to select. The columns can be represented by their index (int or List[int]) or by their name (str or List[str]).
Returns: A CrossValidation object with the selected columns.

class pybnesian.dataset.HoldOut¶

This class implements holdout validation, i.e. it splits the data into training and test sets.

__init__(self: pybnesian.dataset.HoldOut, df: DataFrame, test_ratio: float = 0.2, seed: Optional[int] = None, include_null: bool = False) → None ¶

This constructor takes a DataFrame and returns a split into training an test sets. It shuffles the data before applying the holdout.

Parameters

df – A DataFrame.
test_ratio – Proportion of instances left for the test data.
seed – A random seed number. If not specified or None, a random seed is generated.
include_null – Whether to include the rows where some columns may be null (missing). If false, the rows with some missing values are filtered before performing the cross-validation. Else, all the rows are included.

test_data(self: pybnesian.dataset.HoldOut) → DataFrame¶

Gets the test data.

Returns: Test data.

training_data(self: pybnesian.dataset.HoldOut) → DataFrame¶

Gets the training data.

Returns: Training data.

Dynamic Data¶

class pybnesian.dataset.DynamicDataFrame¶

This class implements the adaptation of a DynamicDataFrame to a dynamic context (temporal series). This is useful to make easier to learn dynamic Bayesian networks.

A DynamicDataFrame creates columns with different temporal delays from the data in the static DataFrame. Each column in the DynamicDataFrame is named with the following pattern: [variable_name]_t_[temporal_index]. The variable_name is the name of each column in the static DataFrame. The temporal_index is an index with a range [0-markovian_order]. The index “0” is considered the “present”, the index “1” delays the temporal one step into the “past”, and so on…

DynamicDataFrame contains two functions DynamicDataFrame.static_df() and DynamicDataFrame.transition_df() that can be used to learn the static Bayesian network and transition Bayesian network components of a dynamic Bayesian network.

All the operations are implemented using a zero-copy strategy to avoid wasting memory.

>>> from pybnesian.dataset import DynamicDataFrame
>>> df = pd.DataFrame({'a': np.arange(10, dtype=float)})
>>> ddf = DynamicDataFrame(df, 2)
>>> ddf.transition_df().to_pandas()
   a_t_0  a_t_1  a_t_2
  2.0    1.0    0.0
  3.0    2.0    1.0
  4.0    3.0    2.0
  5.0    4.0    3.0
  6.0    5.0    4.0
  7.0    6.0    5.0
  8.0    7.0    6.0
  9.0    8.0    7.0
>>> ddf.static_df().to_pandas()
   a_t_1  a_t_2
  1.0    0.0
  2.0    1.0
  3.0    2.0
  4.0    3.0
  5.0    4.0
  6.0    5.0
  7.0    6.0
  8.0    7.0
  9.0    8.0

__init__(self: pybnesian.dataset.DynamicDataFrame, df: DataFrame, markovian_order: int) → None ¶

Creates a DynamicDataFrame from an static DataFrame using a given markovian order.

Parameters

df – A DataFrame.
markovian_order – Markovian order of the transformation.

loc(self: pybnesian.dataset.DynamicDataFrame, columns: DynamicVariable or List[DynamicVariable]) → DataFrame¶

Gets a column or set of columns from the DynamicDataFrame. See DynamicVariable.

Returns: A DataFrame with the selected columns.

>>> from pybnesian.dataset import DynamicDataFrame
>>> df = pd.DataFrame({'a': np.arange(10, dtype=float),
...                    'b': np.arange(0, 100, 10, dtype=float)})
>>> ddf = DynamicDataFrame(df, 2)
>>> ddf.loc(("b", 1)).to_pandas()
   b_t_1
0   10.0
1   20.0
2   30.0
3   40.0
4   50.0
5   60.0
6   70.0
7   80.0
>>> ddf.loc([("a", 0), ("b", 1)]).to_pandas()
   a_t_0  b_t_1
0    2.0   10.0
1    3.0   20.0
2    4.0   30.0
3    5.0   40.0
4    6.0   50.0
5    7.0   60.0
6    8.0   70.0
7    9.0   80.0

All the DynamicVariables in the list must be of the same type, so do not mix different types:

>>> ddf.loc([(0, 0), ("b", 1)]) # do NOT do this!

# Either you use names or indices:
>>> ddf.loc([("a", 0), ("b", 1)]) # GOOD
>>> ddf.loc([(0, 1), (1, 1)]) # GOOD

markovian_order(self: pybnesian.dataset.DynamicDataFrame) → int ¶

Gets the markovian order.

Returns: Markovian order of the DynamicDataFrame.

num_columns(self: pybnesian.dataset.DynamicDataFrame) → int ¶

Gets the number of columns.

Returns: The number of columns. This is equal to the number of columns of DynamicDataFrame.transition_df().

num_rows(self: pybnesian.dataset.DynamicDataFrame) → int ¶

Gets the number of row.

Returns: Number of rows.

num_variables(self: pybnesian.dataset.DynamicDataFrame) → int ¶

Gets the number of variables.

Returns: The number of variables. This is exactly equal to the number of columns in DynamicDataFrame.origin_df().

origin_df(self: pybnesian.dataset.DynamicDataFrame) → DataFrame¶

Gets the original DataFrame.

Returns: The DataFrame passed to the constructor of DynamicDataFrame.

static_df(self: pybnesian.dataset.DynamicDataFrame) → DataFrame¶

Gets the DataFrame for the static Bayesian network. The static network estimates the probability f(t_1,…, t_[markovian_order]). See DynamicDataFrame example.

Returns: A DataFrame with columns from [variable_name]_t_1 to [variable_name]_t_[markovian_order]

temporal_slice(self: pybnesian.dataset.DynamicDataFrame, indices: int or List[int]) → DataFrame¶

Gets a temporal slice or a set of temporal slices. The i-th temporal slice is composed by the columns [variable_name]_t_i

Returns: A DataFrame with the selected temporal slices.

>>> from pybnesian.dataset import DynamicDataFrame
>>> df = pd.DataFrame({'a': np.arange(10, dtype=float), 'b': np.arange(0, 100, 10, dtype=float)})
>>> ddf = DynamicDataFrame(df, 2)
>>> ddf.temporal_slice(1).to_pandas()
   a_t_1  b_t_1
0    1.0   10.0
1    2.0   20.0
2    3.0   30.0
3    4.0   40.0
4    5.0   50.0
5    6.0   60.0
6    7.0   70.0
7    8.0   80.0
>>> ddf.temporal_slice([0, 2]).to_pandas()
   a_t_0  b_t_0  a_t_2  b_t_2
0    2.0   20.0    0.0    0.0
1    3.0   30.0    1.0   10.0
2    4.0   40.0    2.0   20.0
3    5.0   50.0    3.0   30.0
4    6.0   60.0    4.0   40.0
5    7.0   70.0    5.0   50.0
6    8.0   80.0    6.0   60.0
7    9.0   90.0    7.0   70.0

transition_df(self: pybnesian.dataset.DynamicDataFrame) → DataFrame¶

Gets the DataFrame for the transition Bayesian network. The transition network estimates the conditional probability f(t_0 | t_1, …, t_[markovian_order]). See DynamicDataFrame example.

Returns: A DataFrame with columns from [variable_name]_t_0 to [variable_name]_t_[markovian_order]

class pybnesian.dataset.DynamicVariable¶

A DynamicVariable is the representation of a column in a DynamicDataFrame.

A DynamicVariable is a tuple (variable_index, temporal_index). variable_index is a str or int that represents the name or index of the variable in the original static DataFrame. temporal_index is an int that represents the temporal slice in the DynamicDataFrame. See DynamicDataFrame.loc() for usage examples.