The Beginning of Our Data Science Platform Development Journey

When we started developing the Data Science Platform to help data engineers and citizen and expert data scientists manage the end-to-end Data Science model lifecycle at the enterprise level, the key question was to choose the right compute framework. The fundamental need in AI/ML model development is to personalize the decision at the individual level instead of micro/macro segment level. As such, AI/ML models require large iterative compute processes at the time of model training and development. Therefore, the memory computation helps a lot in expediting the training process, leading to up to 100x improvement in turnaround time compared to classical Disk I/O.

Embracing and Enhancing Spark

We selected Open Source Spark for our compute engine and used their existing Spark ML base to build the platform. At that time, Spark had just released 1.x version. We analyzed the gap in the existing Spark version and started working to fill those gaps based on our own platform’s requirements. Therefore, we had to make our own derived version of the Spark ML component as per our need, where we did not make any significant changes to the compute part. For seamless integration with different versions of Spark distribution (Cloud-based like Azure, AWS, etc. and non-cloud-based like Cloudera, MapR one), our team has made enhancements to the Livy module that provides integration with Spark Rest-based API.

A Glimpse into 10 Notable Enhancements We have Done in Spark for our Platform

  1. Deep Learning Algorithms: Spark ML was found lacking in deep learning algorithms capability. We added some of the commonly used deep learning algorithms, such as CNN (Convolutional Neural Network) and RBM (Restricted Boltzmann Machine) which are crucial in solving many use cases of deep learning model approaches.
  2. Use of SMOTE: One of the major problems Data Scientists faces is the requirement to use SMOTE kind of algorithms to tackle data imbalances. This happens in use cases like lending where data is imbalanced. We have extended Spark ML with SMOTE through KNN implementation.
  3. Mechanism for Solving Timeseries Algorithms: Given that there was no support for the timeseries algorithms on Spark, our team fixed the gap by putting ARIMA and ARIMAX on Spark. This has given our platform the capability to solve timeseries problems like sales forecast, price forecast etc.
  4. Ability to Handle Missing Data Scenarios: One of the major problems for solving any AI/ML modeling is handling the missing data. The missing data can be numeric like age or can be a string like address, city, or type of retail, etc. We have added many different scenarios of missing value data handling which helps data scientists to solve user cases properly.
  5. Field-aware Factorization Machine (FFM): There are many scenarios where datasets used for AI/ML modeling are in the form of the sparse matrix. For example, one of the use cases where you need to predict CTR (Click through rate), one needs to use factorization process to be able to use that kind of dataset. We have extended Spark ML with FFM to solve these kinds of use cases.
  6. AutoML Functionality: As platform users have grown, many citizen data scientists started asking for AutoML support. Using AutoML, they can try different algorithms in one go without trying algorithms one-by-one with different parameters. It really helps to try different AI/ML algos in one go and also saves the time to try different individual experiments. We have added AutoML to our platform.
  7. Model Evaluation Metrics: We have added different model evaluation metrics which are not present in Spark ML into the various algorithms, such as regression, clustering, classification, etc. This helps the data science personnel evaluate the model better against multiple metrics at the time of auto retraining of the AI/ML models.
  8. Memory Consumption Optimization: We optimize the memory consumption as the scale of data increases and compute changes corresponding to data values or operation.
  9. Enhanced Data Presentation for Recommendation Scenarios: Typically, algorithms like ALS provide the basic capability for recommendation. We have added advanced capability to check all the products identified for recommendation and built data presentation components to control the presentation basis segments of users.
  10. Addition of Vector Disassembler: We have added vector disassembler for many use-cases like SVD, PCA, One Hot Encoding, etc. It also helps in understanding each dimension of the vector properly through data dumps.

We understand that the product journey is constantly evolving. To stay ahead of the curve, we keep adding the relevant components in Spark with every release. Here are some of the product roadmap components that we plan to add to our existing release of Spark/Spark ML in the AI Cloud platform:

  • Automated logical validation of AI/ML algorithms and other functions in Spark
  • Creation of more image data handling components
  • Support for XML and JSON format data
  • NLP-based summarized logging capability
  • Ability to predict job time corresponding to dataset and parameters of algorithm and functions.
  • Exploratory Data Analysis (EDA) of timeseries data
  • More data engineering and modeling capabilities regarding NLP

I hope you found this blog informative. To stay abreast with the latest updates, keep watching this space.