Reading Time: 3 minutesThe modern tech world has become a data hub reliant on processing. Today, there is user data on everything from driving records to scroll speed on social media applications. As a result, there has been a considerable demand for methods to process this data, given that it holds hidden insights that can propel a company into the global stage quicker than ever before.
These insights are commonly found through machine learning implementations, open-source libraries such as tensorflow, scikit, and many more. While it may be easy to prototype such models, engineering one that can provide usable benefits to the daily operation of a company is still incredibly difficult. This hole in the availability of reliable products is the primary barrier to entry into this market.
Thankfully, Fosfor’s Refract mitigates many of these barriers through a guided and user-friendly approach that allows even those with minimal ML experience to build a robust, reproducible, and shareable model. From start to finish, Refract provides a consistent experience that is easily transferable to different users, without any effort from the user. With Refract, sharing an ML project is just as easy as sharing a Google doc.
End-to-end example
Data manipulation
To demonstrate the capabilities of Refract, I’ve used an interesting test case pulled from Kaggle called Spaceship Titanic. It’s a simple dataset with 14 variables about individuals’ travel information. Due to a “spacetime anomaly,” many of them have been transported to an alternative dimension.
The target I chose was to predict who gets transported based on the other factors. There are 8,693 training examples with <5% null values for any given factor. Included within the data is a series of quantitative inputs, booleans, and strings. The goal was to clean the null values for each data type and encode the Boolean/string values using label encoding. I performed similar cleaning operations through Refract lens and a Python Notebook in Jupyter. Since Refract allows users to create a virtual notebook in the cloud, both methods were seamlessly completed within Refract.
Refract Lens
In Refract, after uploading the data to the cloud, there is a convenient option to create a “lens” for a given data set. With this lens, users can utilize an intuitive GUI to perform the same operations that can be done through code. Refract’s visual interface allows for much easier readability, which is helpful when sharing the project. These features allowed me to easily impute null values and various options, as shown in the given data column. Using label encoding for consistency, I added a new step to the lens. As Refract autodetects which columns are encodable, I selected encoding and chose all columns. From there, I easily saved the cleaned data into a new CSV for later use.
Figure 1: Quick auto-suggestions on Refract ease data preparation steps
Jupyter notebook in Refract
I used the Pandas library to carry out the same concept Refract Lens does by hand. This library is widely used in data manipulation projects in Python. After pulling in the CSV data, the same process of imputing null values with the mode for Boolean values and mean for numeric values was much more code intensive. It is worth noting that with Pandas Library, there is more flexibility for how the data needs to be cleaned through Python. That said, it still required a solid understanding of how the data structures implemented by Pandas work. While this may not be an issue for an experienced data scientist, the ease of a GUI is unparalleled.
Creating a model
Refract AutoML
Refract includes a feature coupled to the GUI interface that allows for the automatic creation of an ML model. The feature uses one of three algorithms:
- MLJAR
- TPOT
- AutoGluonClassification
These are popular open-source libraries for Auto ML. Users can choose to run one or more of them to see which one gives the best accuracy. I ran two of them, TPOT and AutoGluon, on the dataset created through the lens.
From there, I picked the one with the highest accuracy and precision. In this case, AutoGluon supplied an accuracy score of .82 and an F1 score of .82. Furthermore, TPOT provided an accuracy of .77. This meant I chose to register the AutoGluon model. Both models were run on the cleaned dataset from the previous section. The model was created using Sklearn’s libraries. Still, all invocations were hidden in the backend of the AutoML.
Figure 2: Auto ML results page
Jupyter notebook on Refract
For a comparison to AutoML, I used a Tensorflow to create a 6-layer neural network with Relu activations for the hidden layers and a sigmoid for the output. I then compiled it using the Adam optimizer alongside the Binary Cross Entropy loss function. When fit to the cleaned data, the resulting accuracy was 80.2%. Furthermore, the accuracy on a test set, a subset from the initial training set, was 80.4%, showing that the model was not overfit. Both models are very similar concerning accuracy, which illustrates the benefits to efficiency autofitting may provide for more simple models.
Figure 3: Manual model training in Jupyter notebook
Figure 4: Accuracy after manual training
Reproducibility on different models
Taking a step deeper into the capabilities of Refract, I created two more models, one for regression and multi-class classification. Taking similar steps for data prep and modeling, AutoML continued to provide similar results. In fact, they were within a 3% margin of the one done in Jupyter through the cloud within Refract. Since Refract has a plethora of templates available for creating a notebook within the site and the ability to create one from scratch, containers can be tailored specifically to the resource load on a given project.
Figure 5: Different notebook templates to choose from
Varying virtual machine sizes allow for efficient resource utilization regardless of the type of power needed. These amplify the power already present in Jupyter notebook. In this case, my data sets ranged from 100 MB to 2 GB and still ran at a relatively similar speed across projects on both AutoML and Jupyter Notebook kernels. This is simply because it’s as easy to manage as any powerful VM.
Figure 6: Choose how much horsepower you need for the training
Furthermore, once a model has been created, the journey towards production usage becomes intuitive. Refract makes MLOps management more straightforward because of its:
- Easy-to-access build-time metric
- One-click deployment offered as a scalable web service
- Model usage dashboards
- Model explainability and data-drift alerts
Ultimately, the entire life-cycle of a project can be simplified and hosted in a singular place, making AI/ML integration easier for everyone.
Figure 7: Register, deploy, maintain & monitor ML models
Is Refract right for you?
Data science is an incredibly versatile and powerful tool for companies to grow in the modern world. While many solutions exist for individual parts of the process, such as data prep or model deployment, there are not many options like Refract. Refract integrates the entire MLOps cycle into a simple visual interface right from data discovery, data preparation, ML training and the whole model deployment & MLOps.
Request a demo here and give Refract a try.