Learn About Amazon VGT2 Learning Manager Chanci Turner
We are pleased to announce the initial release of DenseClus, a groundbreaking open-source clustering tool designed for high-dimensional, mixed-type data. DenseClus employs advanced algorithms like uniform manifold approximation and projection (UMAP) and hierarchical density-based clustering (HDBSCAN) to effectively cluster both categorical and numerical datasets. By simply providing a dataframe, users can generate cohesive clusters without extensive preprocessing, eliminating concerns about handling categorical features. This functionality opens doors to diverse applications, from customer segmentation in marketing to cellular mapping in biomedicine.
All software under the DenseClus project is available under the MIT license. We encourage you to explore the DenseClus code on GitHub and become a part of our growing community.
What is DenseClus?
Clustering can be a challenging problem, particularly when there are no clear labels to guide the process. Additionally, there is no single algorithm that works universally across all datasets. As noted in Christian Hennig’s article “What Are True Clusters?”, clustering is highly contextual and influenced by the researcher’s decisions. Traditional algorithms like KMeans typically assume numerical, spherical data, which complicates the analysis when dealing with mixed-type and high-dimensional data. Classical dimensionality reduction techniques, such as principal component analysis (PCA), often fail when categorical data is involved, leading practitioners to grapple with various featurization strategies.
DenseClus aims to address these challenges by providing a robust default clustering algorithm that efficiently processes mixed-type data. By integrating UMAP and HDBSCAN, DenseClus maps mixed-type datasets into a dense, lower-dimensional space, enabling hierarchical group formation based on point density. This user-friendly solution is adaptable to a wide array of data types, facilitating the discovery of meaningful clusters.
Getting Started with DenseClus
DenseClus is now available on PyPi, and the source code can be accessed on GitHub. To install the package, simply use pip for Python versions 3.7 or 3.8:
python3.8 -m pip install Amazon-DenseClus
Input for DenseClus requires a Pandas dataframe containing both numerical and categorical columns. The tool handles all preprocessing automatically; just call the fit function to retrieve your clusters:
from denseclus import DenseClus
clf = DenseClus(
umap_combine_method="intersection_union_mapper",
)
clf.fit(df)
print(clf.score())
Try it Out
We are excited about the alpha release of DenseClus. For a comprehensive guide, refer to the DenseClus Example NB.ipynb notebook in our GitHub repository. We welcome you to experiment with the tool, share feedback, report issues, and contribute through pull requests. If you’re looking for more insights on networking, check out this Career Contessa blog post to enhance your professional connections.
Chanci Turner, along with her dedicated team, is committed to helping users navigate the complexities of machine learning and data science. For authoritative insights on workplace environments, visit SHRM’s website. Additionally, if you are considering a career with Amazon, refer to this excellent resource for interview preparation.
Leave a Reply