Data Manipulation
PyBNesian implements some useful dataset manipulation techniques such as k-fold cross validation and hold-out.
DataFrame
Internally, PyBNesian uses a pyarrow.RecordBatch to enable a zero-copy data exchange between C++ and Python.
Most of the classes and methods takes as argument, or returns a DataFrame type. This represents an
encapsulation of pyarrow.RecordBatch:
When a
DataFrameis taken as argument in a function, both apyarrow.RecordBatchor apandas.DataFramecan be used as a parameter.When PyBNesian specifies a
DataFramereturn type, apyarrow.RecordBatchis returned. This can be converted easily to apandas.DataFrameusingpyarrow.RecordBatch.to_pandas().
DataFrame Operations
- class pybnesian.CrossValidation
This class implements k-fold cross-validation, i.e. it splits the data into k disjoint sets of train and test data.
- __init__(self: pybnesian.CrossValidation, df: DataFrame, k: int = 10, seed: int | None = None, include_null: bool = False) None
This constructor takes a
DataFrameand returns a k-fold cross-validation. It shuffles the data before applying the cross-validation.- Parameters:
df – A
DataFrame.k – Number of folds.
seed – A random seed number. If not specified or
None, a random seed is generated.include_null – Whether to include the rows where some columns may be null (missing). If false, the rows with some missing values are filtered before performing the cross-validation. Else, all the rows are included.
- Raises:
ValueError – If k is greater than the number of rows.
- __iter__(self: pybnesian.CrossValidation) Iterator[tuple[DataFrame, DataFrame]]
Iterates over the k-fold cross-validation.
- Returns:
The iterator returns a tuple (
DataFrame,DataFrame) which contains the training data and test data of each fold.
>>> from pybnesian import CrossValidation >>> df = pd.DataFrame({'a': np.random.rand(20), 'b': np.random.rand(20)}) >>> for (training_data, test_data) in CrossValidation(df): ... assert training_data.num_rows == 18 ... assert test_data.num_rows == 2
- fold(self: pybnesian.CrossValidation, index: int) tuple[DataFrame, DataFrame]
Returns the index-th fold.
- Parameters:
index – Fold index.
- Returns:
A tuple (
DataFrame,DataFrame) which contains the training data and test data of each fold.
- indices(self: pybnesian.CrossValidation) Iterator[tuple[list[int], list[int]]]
Iterates over the row indices of each training and test
DataFrame.- Returns:
A tuple (list, list) containing the row indices (with respect to the original
DataFrame) of the train and test data of each fold.
>>> from pybnesian import CrossValidation >>> df = pd.DataFrame({'a': np.random.rand(20), 'b': np.random.rand(20)}) >>> for (training_indices, test_indices) in CrossValidation(df).indices(): ... assert set(range(20)) == set(list(training_indices) + list(test_indices))
- loc(self: pybnesian.CrossValidation, columns: str or int or List[str] or List[int]) CrossValidation
Selects columns from the
CrossValidationobject.- Parameters:
columns – Columns to select. The columns can be represented by their index (int or List[int]) or by their name (str or List[str]).
- Returns:
A
CrossValidationobject with the selected columns.
- class pybnesian.HoldOut
This class implements holdout validation, i.e. it splits the data into training and test sets.
- __init__(self: pybnesian.HoldOut, df: DataFrame, test_ratio: float = 0.2, seed: int | None = None, include_null: bool = False) None
This constructor takes a
DataFrameand returns a split into training an test sets. It shuffles the data before applying the holdout.- Parameters:
df – A
DataFrame.test_ratio – Proportion of instances left for the test data.
seed – A random seed number. If not specified or
None, a random seed is generated.include_null – Whether to include the rows where some columns may be null (missing). If false, the rows with some missing values are filtered before performing the cross-validation. Else, all the rows are included.
- test_data(self: pybnesian.HoldOut) DataFrame
Gets the test data.
- Returns:
Test data.
- training_data(self: pybnesian.HoldOut) DataFrame
Gets the training data.
- Returns:
Training data.
Dynamic Data
- class pybnesian.DynamicDataFrame
This class implements the adaptation of a
DynamicDataFrameto a dynamic context (temporal series). This is useful to make easier to learn dynamic Bayesian networks.A
DynamicDataFramecreates columns with different temporal delays from the data in the staticDataFrame. Each column in theDynamicDataFrameis named with the following pattern:[variable_name]_t_[temporal_index]. Thevariable_nameis the name of each column in the staticDataFrame. Thetemporal_indexis an index with a range [0-markovian_order]. The index “0” is considered the “present”, the index “1” delays the temporal one step into the “past”, and so on…DynamicDataFramecontains two functionsDynamicDataFrame.static_df()andDynamicDataFrame.transition_df()that can be used to learn the static Bayesian network and transition Bayesian network components of a dynamic Bayesian network.All the operations are implemented using a zero-copy strategy to avoid wasting memory.
>>> from pybnesian import DynamicDataFrame >>> df = pd.DataFrame({'a': np.arange(10, dtype=float)}) >>> ddf = DynamicDataFrame(df, 2) >>> ddf.transition_df().to_pandas() a_t_0 a_t_1 a_t_2 0 2.0 1.0 0.0 1 3.0 2.0 1.0 2 4.0 3.0 2.0 3 5.0 4.0 3.0 4 6.0 5.0 4.0 5 7.0 6.0 5.0 6 8.0 7.0 6.0 7 9.0 8.0 7.0 >>> ddf.static_df().to_pandas() a_t_1 a_t_2 0 1.0 0.0 1 2.0 1.0 2 3.0 2.0 3 4.0 3.0 4 5.0 4.0 5 6.0 5.0 6 7.0 6.0 7 8.0 7.0 8 9.0 8.0
- __init__(self: pybnesian.DynamicDataFrame, df: DataFrame, markovian_order: int) None
Creates a
DynamicDataFramefrom an staticDataFrameusing a given markovian order.- Parameters:
df – A
DataFrame.markovian_order – Markovian order of the transformation.
- loc(self: pybnesian.DynamicDataFrame, columns: DynamicVariable or List[DynamicVariable]) DataFrame
Gets a column or set of columns from the
DynamicDataFrame. SeeDynamicVariable.- Returns:
A
DataFramewith the selected columns.
>>> from pybnesian import DynamicDataFrame >>> df = pd.DataFrame({'a': np.arange(10, dtype=float), ... 'b': np.arange(0, 100, 10, dtype=float)}) >>> ddf = DynamicDataFrame(df, 2) >>> ddf.loc(("b", 1)).to_pandas() b_t_1 0 10.0 1 20.0 2 30.0 3 40.0 4 50.0 5 60.0 6 70.0 7 80.0 >>> ddf.loc([("a", 0), ("b", 1)]).to_pandas() a_t_0 b_t_1 0 2.0 10.0 1 3.0 20.0 2 4.0 30.0 3 5.0 40.0 4 6.0 50.0 5 7.0 60.0 6 8.0 70.0 7 9.0 80.0
All the DynamicVariables in the list must be of the same type, so do not mix different types:
>>> ddf.loc([(0, 0), ("b", 1)]) # do NOT do this! # Either you use names or indices: >>> ddf.loc([("a", 0), ("b", 1)]) # GOOD >>> ddf.loc([(0, 1), (1, 1)]) # GOOD
- markovian_order(self: pybnesian.DynamicDataFrame) int
Gets the markovian order.
- Returns:
Markovian order of the
DynamicDataFrame.
- num_columns(self: pybnesian.DynamicDataFrame) int
Gets the number of columns.
- Returns:
The number of columns. This is equal to the number of columns of
DynamicDataFrame.transition_df().
- num_rows(self: pybnesian.DynamicDataFrame) int
Gets the number of row.
- Returns:
Number of rows.
- num_variables(self: pybnesian.DynamicDataFrame) int
Gets the number of variables.
- Returns:
The number of variables. This is exactly equal to the number of columns in
DynamicDataFrame.origin_df().
- origin_df(self: pybnesian.DynamicDataFrame) DataFrame
Gets the original
DataFrame.- Returns:
The
DataFramepassed to the constructor ofDynamicDataFrame.
- static_df(self: pybnesian.DynamicDataFrame) DataFrame
Gets the
DataFramefor the static Bayesian network. The static network estimates the probability f(t_1,…,t_[markovian_order]). See DynamicDataFrame example.- Returns:
A
DataFramewith columns from[variable_name]_t_1to[variable_name]_t_[markovian_order]
- temporal_slice(self: pybnesian.DynamicDataFrame, indices: int or List[int]) DataFrame
Gets a temporal slice or a set of temporal slices. The i-th temporal slice is composed by the columns
[variable_name]_t_i- Returns:
A
DataFramewith the selected temporal slices.
>>> from pybnesian import DynamicDataFrame >>> df = pd.DataFrame({'a': np.arange(10, dtype=float), 'b': np.arange(0, 100, 10, dtype=float)}) >>> ddf = DynamicDataFrame(df, 2) >>> ddf.temporal_slice(1).to_pandas() a_t_1 b_t_1 0 1.0 10.0 1 2.0 20.0 2 3.0 30.0 3 4.0 40.0 4 5.0 50.0 5 6.0 60.0 6 7.0 70.0 7 8.0 80.0 >>> ddf.temporal_slice([0, 2]).to_pandas() a_t_0 b_t_0 a_t_2 b_t_2 0 2.0 20.0 0.0 0.0 1 3.0 30.0 1.0 10.0 2 4.0 40.0 2.0 20.0 3 5.0 50.0 3.0 30.0 4 6.0 60.0 4.0 40.0 5 7.0 70.0 5.0 50.0 6 8.0 80.0 6.0 60.0 7 9.0 90.0 7.0 70.0
- transition_df(self: pybnesian.DynamicDataFrame) DataFrame
Gets the
DataFramefor the transition Bayesian network. The transition network estimates the conditional probability f(t_0|t_1, …,t_[markovian_order]). See DynamicDataFrame example.- Returns:
A
DataFramewith columns from[variable_name]_t_0to[variable_name]_t_[markovian_order]
- class DynamicVariable
A DynamicVariable is the representation of a column in a
DynamicDataFrame.A DynamicVariable is a tuple (
variable_index,temporal_index).variable_indexis astrorintthat represents the name or index of the variable in the original staticDataFrame.temporal_indexis anintthat represents the temporal slice in theDynamicDataFrame. SeeDynamicDataFrame.locfor usage examples.