src.clustering

Submodules

Classes

BaseClustering

Base class for clustering models used to group data based on embeddings or other features.

Functions

get_clustering_model(method_name, *args, **kwargs)

Package Contents

class src.clustering.BaseClustering(dataset_name: str = None, embedding_method: str = None, dataset_id: int = None, embedding_id: int = None, embeddings: pandas.DataFrame = None)

Base class for clustering models used to group data based on embeddings or other features.

This class provides core functionality for clustering models, including loading data and storing clustering results. It is meant to be subclassed by specific clustering algorithms, which should implement their own logic for fitting the model and predicting clusters.

Attributes:

datapd.DataFrame or None: DataFrame containing the data to be clustered.
labelspd.Series or None: Series containing the cluster labels assigned to the data.

Methods:

load_data(file_path: str) -> pd.DataFrame:: Loads a dataset from a CSV or pickle file into a pandas DataFrame.
save_labels(file_path: str):: Saves the cluster labels to a CSV or pickle file.

dataset_name

embedding_method

dataset_id

embedding_id

embeddings

data = None

labels = None

load_data() → pandas.DataFrame

Load the data to be clustered from a CSV or pickle file.

Parameters:: file_path – Path to the data file (CSV or pickle format).
Returns:: DataFrame containing the loaded data.

scale_data(data)

save_labels(labels)

Save the cluster labels to a CSV or pickle file.

Parameters:: file_path – Path where the cluster labels will be saved.

fit_predict()

Parameters:: self.embeddings – DataFrame containing the data to be clustered.
Returns:: DataFrame containing the cluster labels assigned to the data.

src.clustering.get_clustering_model(method_name: str, *args, **kwargs)