Distributed Synthetic Minority Oversampling Technique
- DOI
- 10.2991/ijcis.d.190719.001How to use a DOI?
- Keywords
- SMOTE; apache spark; prediction; machine learning; imbalanced classification
- Abstract
Real world problems for prediction usually try to predict rare occurrences. Application of standard classification algorithm is biased toward against these rare events, due to this data imbalance. Typical approaches to solve this data imbalance involve oversampling these “rare events” or under sampling the majority occurring events. Synthetic Minority Oversampling Technique is one technique that addresses this class imbalance effectively. However, the existing implementations of SMOTE fail when data grows and can't be stored on a single machine. In this paper present our solution to address the “big data challenge.” We provide a distributed version of SMOTE by using scalable k-means++ and M-Trees. With this implementation of SMOTE, we were able to oversample the “rare events” and achieve results which are better than the existing python version of SMOTE.
- Copyright
- © 2019 The Authors. Published by Atlantis Press SARL.
- Open Access
- This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).
Download article (PDF)
View full text (HTML)
Cite this article
TY - JOUR AU - Sakshi Hooda AU - Suman Mann PY - 2019 DA - 2019/07/30 TI - Distributed Synthetic Minority Oversampling Technique JO - International Journal of Computational Intelligence Systems SP - 929 EP - 936 VL - 12 IS - 2 SN - 1875-6883 UR - https://doi.org/10.2991/ijcis.d.190719.001 DO - 10.2991/ijcis.d.190719.001 ID - Hooda2019 ER -