Data Manipulation¶
The pybnesian.dataset module implements some useful dataset manipulation techniques such as k-fold cross validation and hold-out.
DataFrame¶
Internally, PyBNesian uses a pyarrow.RecordBatch
to enable a zero-copy data exchange between C++ and Python.
Most of the classes and methods takes as argument, or returns a DataFrame
type. This represents an
encapsulation of pyarrow.RecordBatch
:
When a
DataFrame
is taken as argument in a function, both apyarrow.RecordBatch
or apandas.DataFrame
can be used as a parameter.When PyBNesian specifies a
DataFrame
return type, apyarrow.RecordBatch
is returned. This can be converted easily to apandas.DataFrame
usingpyarrow.RecordBatch.to_pandas()
.
DataFrame Operations¶
- class pybnesian.dataset.CrossValidation¶
This class implements k-fold cross-validation, i.e. it splits the data into k disjoint sets of train and test data.
- __init__(self: pybnesian.dataset.CrossValidation, df: DataFrame, k: int = 10, seed: Optional[int] = None, include_null: bool = False) → None¶
This constructor takes a
DataFrame
and returns a k-fold cross-validation. It shuffles the data before applying the cross-validation.- Parameters
df – A
DataFrame
.k – Number of folds.
seed – A random seed number. If not specified or
None
, a random seed is generated.include_null – Whether to include the rows where some columns may be null (missing). If false, the rows with some missing values are filtered before performing the cross-validation. Else, all the rows are included.
- Raises
ValueError – If k is greater than the number of rows.
- __iter__(self: pybnesian.dataset.CrossValidation) → Iterator¶
Iterates over the k-fold cross-validation.
- Returns
The iterator returns a tuple (
DataFrame
,DataFrame
) which contains the training data and test data of each fold.
>>> from pybnesian.dataset import CrossValidation >>> df = pd.DataFrame({'a': np.random.rand(20), 'b': np.random.rand(20)}) >>> for (training_data, test_data) in CrossValidation(df): ... assert training_data.num_rows == 18 ... assert test_data.num_rows == 2
- fold(self: pybnesian.dataset.CrossValidation, index: int) → Tuple[DataFrame, DataFrame]¶
Returns the index-th fold.
- Parameters
index – Fold index.
- Returns
A tuple (
DataFrame
,DataFrame
) which contains the training data and test data of each fold.
- indices(self: pybnesian.dataset.CrossValidation) → Iterator¶
Iterates over the row indices of each training and test
DataFrame
.- Returns
A tuple (list, list) containing the row indices (with respect to the original
DataFrame
) of the train and test data of each fold.
>>> from pybnesian.dataset import CrossValidation >>> df = pd.DataFrame({'a': np.random.rand(20), 'b': np.random.rand(20)}) >>> for (training_indices, test_indices) in CrossValidation(df).indices(): ... assert set(range(20)) == set(list(training_indices) + list(test_indices))
- loc(self: pybnesian.dataset.CrossValidation, columns: str or int or List[str] or List[int]) → CrossValidation¶
Selects columns from the
CrossValidation
object.- Parameters
columns – Columns to select. The columns can be represented by their index (int or List[int]) or by their name (str or List[str]).
- Returns
A
CrossValidation
object with the selected columns.
- class pybnesian.dataset.HoldOut¶
This class implements holdout validation, i.e. it splits the data into training and test sets.
- __init__(self: pybnesian.dataset.HoldOut, df: DataFrame, test_ratio: float = 0.2, seed: Optional[int] = None, include_null: bool = False) → None¶
This constructor takes a
DataFrame
and returns a split into training an test sets. It shuffles the data before applying the holdout.- Parameters
df – A
DataFrame
.test_ratio – Proportion of instances left for the test data.
seed – A random seed number. If not specified or
None
, a random seed is generated.include_null – Whether to include the rows where some columns may be null (missing). If false, the rows with some missing values are filtered before performing the cross-validation. Else, all the rows are included.
- test_data(self: pybnesian.dataset.HoldOut) → DataFrame¶
Gets the test data.
- Returns
Test data.
- training_data(self: pybnesian.dataset.HoldOut) → DataFrame¶
Gets the training data.
- Returns
Training data.
Dynamic Data¶
- class pybnesian.dataset.DynamicDataFrame¶
This class implements the adaptation of a
DynamicDataFrame
to a dynamic context (temporal series). This is useful to make easier to learn dynamic Bayesian networks.A
DynamicDataFrame
creates columns with different temporal delays from the data in the staticDataFrame
. Each column in theDynamicDataFrame
is named with the following pattern:[variable_name]_t_[temporal_index]
. Thevariable_name
is the name of each column in the staticDataFrame
. Thetemporal_index
is an index with a range [0-markovian_order
]. The index “0” is considered the “present”, the index “1” delays the temporal one step into the “past”, and so on…DynamicDataFrame
contains two functionsDynamicDataFrame.static_df()
andDynamicDataFrame.transition_df()
that can be used to learn the static Bayesian network and transition Bayesian network components of a dynamic Bayesian network.All the operations are implemented using a zero-copy strategy to avoid wasting memory.
>>> from pybnesian.dataset import DynamicDataFrame >>> df = pd.DataFrame({'a': np.arange(10, dtype=float)}) >>> ddf = DynamicDataFrame(df, 2) >>> ddf.transition_df().to_pandas() a_t_0 a_t_1 a_t_2 0 2.0 1.0 0.0 1 3.0 2.0 1.0 2 4.0 3.0 2.0 3 5.0 4.0 3.0 4 6.0 5.0 4.0 5 7.0 6.0 5.0 6 8.0 7.0 6.0 7 9.0 8.0 7.0 >>> ddf.static_df().to_pandas() a_t_1 a_t_2 0 1.0 0.0 1 2.0 1.0 2 3.0 2.0 3 4.0 3.0 4 5.0 4.0 5 6.0 5.0 6 7.0 6.0 7 8.0 7.0 8 9.0 8.0
- __init__(self: pybnesian.dataset.DynamicDataFrame, df: DataFrame, markovian_order: int) → None¶
Creates a
DynamicDataFrame
from an staticDataFrame
using a given markovian order.- Parameters
df – A
DataFrame
.markovian_order – Markovian order of the transformation.
- loc(self: pybnesian.dataset.DynamicDataFrame, columns: DynamicVariable or List[DynamicVariable]) → DataFrame¶
Gets a column or set of columns from the
DynamicDataFrame
. SeeDynamicVariable
.- Returns
A
DataFrame
with the selected columns.
>>> from pybnesian.dataset import DynamicDataFrame >>> df = pd.DataFrame({'a': np.arange(10, dtype=float), ... 'b': np.arange(0, 100, 10, dtype=float)}) >>> ddf = DynamicDataFrame(df, 2) >>> ddf.loc(("b", 1)).to_pandas() b_t_1 0 10.0 1 20.0 2 30.0 3 40.0 4 50.0 5 60.0 6 70.0 7 80.0 >>> ddf.loc([("a", 0), ("b", 1)]).to_pandas() a_t_0 b_t_1 0 2.0 10.0 1 3.0 20.0 2 4.0 30.0 3 5.0 40.0 4 6.0 50.0 5 7.0 60.0 6 8.0 70.0 7 9.0 80.0
All the DynamicVariables in the list must be of the same type, so do not mix different types:
>>> ddf.loc([(0, 0), ("b", 1)]) # do NOT do this! # Either you use names or indices: >>> ddf.loc([("a", 0), ("b", 1)]) # GOOD >>> ddf.loc([(0, 1), (1, 1)]) # GOOD
- markovian_order(self: pybnesian.dataset.DynamicDataFrame) → int¶
Gets the markovian order.
- Returns
Markovian order of the
DynamicDataFrame
.
- num_columns(self: pybnesian.dataset.DynamicDataFrame) → int¶
Gets the number of columns.
- Returns
The number of columns. This is equal to the number of columns of
DynamicDataFrame.transition_df()
.
- num_rows(self: pybnesian.dataset.DynamicDataFrame) → int¶
Gets the number of row.
- Returns
Number of rows.
- num_variables(self: pybnesian.dataset.DynamicDataFrame) → int¶
Gets the number of variables.
- Returns
The number of variables. This is exactly equal to the number of columns in
DynamicDataFrame.origin_df()
.
- origin_df(self: pybnesian.dataset.DynamicDataFrame) → DataFrame¶
Gets the original
DataFrame
.- Returns
The
DataFrame
passed to the constructor ofDynamicDataFrame
.
- static_df(self: pybnesian.dataset.DynamicDataFrame) → DataFrame¶
Gets the
DataFrame
for the static Bayesian network. The static network estimates the probability f(t_1
,…,t_[markovian_order]
). See DynamicDataFrame example.- Returns
A
DataFrame
with columns from[variable_name]_t_1
to[variable_name]_t_[markovian_order]
- temporal_slice(self: pybnesian.dataset.DynamicDataFrame, indices: int or List[int]) → DataFrame¶
Gets a temporal slice or a set of temporal slices. The i-th temporal slice is composed by the columns
[variable_name]_t_i
- Returns
A
DataFrame
with the selected temporal slices.
>>> from pybnesian.dataset import DynamicDataFrame >>> df = pd.DataFrame({'a': np.arange(10, dtype=float), 'b': np.arange(0, 100, 10, dtype=float)}) >>> ddf = DynamicDataFrame(df, 2) >>> ddf.temporal_slice(1).to_pandas() a_t_1 b_t_1 0 1.0 10.0 1 2.0 20.0 2 3.0 30.0 3 4.0 40.0 4 5.0 50.0 5 6.0 60.0 6 7.0 70.0 7 8.0 80.0 >>> ddf.temporal_slice([0, 2]).to_pandas() a_t_0 b_t_0 a_t_2 b_t_2 0 2.0 20.0 0.0 0.0 1 3.0 30.0 1.0 10.0 2 4.0 40.0 2.0 20.0 3 5.0 50.0 3.0 30.0 4 6.0 60.0 4.0 40.0 5 7.0 70.0 5.0 50.0 6 8.0 80.0 6.0 60.0 7 9.0 90.0 7.0 70.0
- transition_df(self: pybnesian.dataset.DynamicDataFrame) → DataFrame¶
Gets the
DataFrame
for the transition Bayesian network. The transition network estimates the conditional probability f(t_0
|t_1
, …,t_[markovian_order]
). See DynamicDataFrame example.- Returns
A
DataFrame
with columns from[variable_name]_t_0
to[variable_name]_t_[markovian_order]
- class pybnesian.dataset.DynamicVariable¶
A DynamicVariable is the representation of a column in a
DynamicDataFrame
.A DynamicVariable is a tuple (
variable_index
,temporal_index
).variable_index
is astr
orint
that represents the name or index of the variable in the original staticDataFrame
.temporal_index
is anint
that represents the temporal slice in theDynamicDataFrame
. SeeDynamicDataFrame.loc()
for usage examples.