Leveraging Snowpark for analytics workloads on Lumin

3 min read

Reading Time: 3 minutes

Analytics workloads in Decision Intelligence products and the need to optimize them

Decision Intelligence products help businesses make timely decisions that are accountable and actionable, with accurate inferences and recommendations. Decision Intelligence platforms need to offer key features that enable users to perform prescriptive, predictive, diagnostic, and descriptive analyses. Since theserequirecompute-intense processing, it is critical for us to minimize latency, maximize speed, and manage the scale of any additional data analytics workload on these DI platforms.

In this blog, we discuss the challenges of conventional ML workflows and how we can leverage a Cloud-based ML workflow to overcome these challenges.

As an enthusiast for data & amp; insights, and with my experience in curating critical engineering features for Lumin- a leading Decision Intelligence product, I believe I can offer a unique perspective on how a Cloud-based ML workflow is a better option than conventional ML workflows.

Machine Learning (ML) workflow: Conventional and run of the mill

The conventional machine learning workflows(See Fig. 1) fetch the data out of the storage layer, pre-process it, run it through the training models and finally save the model in a Cloud storage like S3.

This process has two major drawbacks:

High costs: Having a separate compute infrastructure from the data eventually increases maintenance costs in the long run.
Security: Moving the data out of the secured storage layer, leaves it potentially decrypted and vulnerable to attacks.

Alternatively, the computations can be processed where the data lives with Python, Java, and Scala code natively in Snowflake, ensuring that all the workloads are handled without moving data outside of the governed boundary and eliminating exposure to security risks. This could reduce maintenance costs, ensure data security, and improve overall performance by not having to shuffle data across environments.

Figure 1: Conventional ML workflow

Paving the path for a Cloud-based ML workflow with Snowpark

Snowpark is a set of libraries and runtimes that securely enable developers to deploy and process Python code in Snowflake.

Familiar client-side libraries: Snowpark brings deeply integrated, DataFrame-style programming and OSS compatible APIs to the languages,which data practitioners like to use. It also includes the Snowpark ML library for faster and more intuitive end-to-end machine learning in Snowflake (See Figure 2). Snowpark ML has 2 APIs – Snowpark ML Modeling (public preview) for model development, and Snowpark ML Operations (private preview) for model deployment.

Flexible runtime constructs: Snowpark provides flexible runtime constructs that allow users to bring in and run custom logic. Developers can seamlessly build data pipelines, ML models, and data applications with User-Defined Functions (UDFs) and Stored Procedures (SPROCs). Developers can also leverage the embedded Anaconda repository for effortless access to thousands of pre-installed open-source libraries.

These capabilities allow data engineers, data scientists, and data developers to build pipelines, ML models,and applications faster and more securely on a single platform using their language of choice.

Figure 2: Machine Learning end-to-end workflow with Snowpark.

Snowpark code can be executed using two approaches(See Fig. 3):

Data Frames: These APIs are pushed to Snowflake where they are executed in Snowflake’s elastic engine either as SQL query or more sophisticated UDFs automatically on behalf of the users.
Python Functions: These can be creation of UDF/Stored Procedures, during which Snowpark serializes and uploads the code to a stage. While calling of UDF/Stored Procedures, Snowpark executes the function in the secure Python sandbox in the server-side runtime, where the data is located.Snowpark is also integrated with the Anaconda repository, which has access to thousands of curated, open-source Python packages.

Figure 3: Snowpark approaches to code.

Let us look at some sample code for each of these approaches:
Dataframe API (Fig. 4) which can be used for data preparation and feature engineering.

Figure 4: Snowpark Dataframe API example

Stored Procedures(as illustrated in Figure 5) can be used for model training.

Figure 5: Snowpark stored procedure example

An example for Snowpark UDFs is illustrated below (Figure 6)

Figure 6: Snowpark UDF example

Now that we have seen how conventional ML and Cloud-based ML workflows function, here is a side-by-side comparison of both approaches.

As you can see there is a clear advantage that a Cloud-based ML workflow offers businesses. Now let me show you a glimpse of how Lumin takes advantage of the power of the Cloud.

Lumin’s Declarative AI powered by Snowflake

Time series Forecasting is one of the more advanced features of Lumin and is in fact the prime showcase of Lumin’s Declarative AI capabilities.

Here’s how it works: Let’s assume a business user wants to see how the sales would look like with the current trend of business – the user can simply ask in plain English language, “What will be my sales in the next 6 months?”. Lumin’s intelligence layer understands this natural language, converts into Lumin query language and hands over to the Analytics core engine.

The analytics core engine (See Fig. 7) understands the requirement and configuration from the self-serve layer and executes the following steps to arrive at a final output.

Figure 7: Analytics core engine workflow

Data preparation: In this step, the most common and essential preparation techniques like sanity checks, seasonality checks, stationary checks, missing values treatment, outlier removal, etcetera, will be performed.

Model training: Lumin’s forecasting engine leverages ensemble modeling technique to choose the best model given the data.

Hyperparameter tuning is done at this stage using Grid Search, and the champion model is selected by minimizing MAPE (Mean Absolute Percentage Error) value, and subsequently saved inside Snowflake’s internal stage in a pickled format. This entire training logic exists within Snowflake Data Warehouse in the form of stored procedures. The Snowpark client gets initialized with the analytic core engine for registering these stored procedures and invoking them with required inputs. Neither the data nor the model is transferred outside Snowflake’s infrastructure.

Model explanation and Narratives: Lumin along with the forecast view also provides the narratives and explains how it arrived at specific insights (See).

Figure 8: Model explanation

On the self-serve layer, creators can configure sales as the measure, and choose the machine learning technique(s) on which the forecasting needs to be performed, or simply set it to Auto mode.

Model Inference: Lumin enables user to simulate (See) and understand the impact on the measure by altering the exogenous factors. The updated values are passed to the Analytics core engine and used for inferencing. Here UDFs are used to fetch the saved model from Snowflake internal stage, execute the prediction within, and return the result to the application visualisation layer.

Figure 9: Run simulation.

Testing the firepower

As an engineering group, it is in our best interests to conduct thorough testing of the product features in terms of performance. We tested our features driven with Snowflake-Snowpark vs three different analytical workloads (PySpark, FastAPI based Microservices, and SQL logics), run on custom datasets and scenarios.

During these tests, we were able to observe that the Snowflake-Snowpark workflow yielded very good results. We observed that for the forecast workloads, which were converted from FastAPI based microservices to Snowpark SPROC, Lumin achieved a 20% improvement in speed of execution, and a 14% in cost savings. We also noticed that Lumin’s key driver analysis revealed a 90%-time benefit and a 12% cost benefit by transitioning from the previous Spark-based batch mode to the new Snowpark-based real-time run, which incorporates an algorithmic upgrade.

Subsequently, Nudges validation was converted from Spark to Snowflake SQL, which yielded a further ~80% improvement in speed of execution and a net cost savings of up to 65%.

It is also essential to note that conventional ML workflows have connected microservices and static clusters for execution, which increases the overall runtime and cost. Taking all this into consideration, we will be able to observe exponential benefits with Snowflake &; Snowpark, both in terms of cost and speed of execution.

References

Figure 2: Click here
Figure 3: Click here

Author

Sudhir Kakumanu

Associate Principal, Fosfor

Sudhir is an Associate Principal for Decision Intelligence at Fosfor. He heads the Lumin Insights Engineering team and steers the Lumin Architecture Council. He has 15 years of experience building Cloud-based AI/ML solutions and edge solutions with Computer Vision, IoT, Speech, and Data. He has built enterprise hardware/software stacks, worked with core semiconductor companies like Intel and Ericsson, and has also co-founded a deep tech startup. In his free time, he listens to music, plays games, watches anime, explores culinary delights, and adds to his already impressive Google local guide level 7 badge.

More on the topic

Read more thought leadership from our team of experts

A tale of two events: Inside Snowflake’s and Databricks' marquee events

The simultaneous timing of the events did raise some eyebrows in the industry, and soon it became evident that a fierce competition was unfolding between the two powerhouses, both vying for a similar target audience and a larger partner ecosystem. While Snowflake’s spectacular show in Las Vegas boasted over 12000 confirmed attendees, there was an equally palpable excitement in the air for the Databricks event happening in San Francisco.

Harnessing the power of Lumin and Streamlit in Snowflake: From data exploration and visualization to Decision Intelligence

As a data enthusiast myself, I understand the importance of data exploration, visualization, and decision intelligence in today's data driven world. That is why I am thrilled to share with you how Lumin and Streamlit can revolutionize your data analysis experience in Snowflake, the leading cloud-based data warehouse platform. So, fasten your seatbelts as we embark on a journey to harness the true power of Lumin and Streamlit to unlock the potential of your data.

How Spectra and Snowflake can illuminate your data journey

Considering that more people are becoming tech-savvy, enterprises need to focus on data and derive actionable insights as quickly as possible.

Privacy & Cookie policy

Privacy & Cookies policy

Cookie name	Active
sess_map

What is a cookie?

A cookie is a small piece of data that a website asks your browser to store on your computer or mobile device. The cookie allows the website to “remember” your actions or preferences over time. On future visits, this data is then returned to that website to help identify you and your site preferences. Our websites and mobile sites use cookies to give you the best online experience. Most Internet browsers support cookies; however, users can set their browsers to decline certain types of cookies or specific cookies. Further, users can delete cookies at any time.

Why do we use cookies?

We use cookies to learn how you interact with our content and to improve your experience when visiting our website(s). For example, some cookies remember your language or preferences so that you do not have to repeatedly make these choices when you visit one of our websites.

What kind of cookies do we use?

We use the following categories of cookie:

Category 1: Strictly Necessary Cookies

Strictly necessary cookies are those that are essential for our sites to work in the way you have requested. Although many of our sites are open, that is, they do not require registration; we may use strictly necessary cookies to control access to some of our community sites, whitepapers or online events such as webinars; as well as to maintain your session during a single visit. These cookies will need to reset on your browser each time you register or log in to a gated area. If you block these cookies entirely, you may not be able to access gated areas. We may also offer you the choice of a persistent cookie to recognize you as you return to one of our gated sites. If you choose not to use this “remember me” function, you will simply need to log in each time you return.

Cookie Name	Domain / Associated Domain / Third-Party Service	Description	Retention period
__cfduid	Cloudflare	Cookie associated with sites using CloudFlare, used to speed up page load times	1 Year
lidc	linkedin.com	his is a Microsoft MSN 1^st party cookie that ensures the proper functioning of this website.	1 Day
PHPSESSID	ltimindtree.com	Cookies named PHPSESSID only contain a reference to a session stored on the web server	When the browsing session ends
catAccCookies	ltimindtree.com	Cookie set by the UK cookie consent plugin to record that you accept the fact that the site uses cookies.	29 Days
AWSELB		Used to distribute traffic to the website on several servers in order to optimise response times.	2437 Days
JSESSIONID	linkedin.com	Preserves users states across page requests.	334,416 Days
checkForPermission	bidr.io	Determines whether the visitor has accepted the cookie consent box.	1 Day
VISITOR_INFO1_LIVE		Tries to estimate users bandwidth on the pages with integrated YouTube videos.	179 Days

Category 2: Performance Cookies

Performance cookies, often called analytics cookies, collect data from visitors to our sites on a unique, but anonymous basis. The results are reported to us as aggregate numbers and trends. LTI allows third-parties to set performance cookies. We rely on reports to understand our audiences, and improve how our websites work. We use Google Analytics, a web analytics service provided by Google, Inc. (“Google”), which in turn uses performance cookies. Information generated by the cookies about your use of our website will be transmitted to and stored by Google on servers Worldwide. The IP-address, which your browser conveys within the scope of Google Analytics, will not be associated with any other data held by Google. You may refuse the use of cookies by selecting the appropriate settings on your browser. However, you have to note that if you do this, you may not be able to use the full functionality of our website. You can also opt-out from being tracked by Google Analytics from any future instances, by downloading and installing Google Analytics Opt-out Browser Add-on for your current web browser: https://tools.google.com/dlpage/gaoptout & cookiechoices.org and privacy.google.com/businesses

Cookie Name	Domain / Associated Domain / Third-Party Service	Description	Retention period
_ga	ltimindtree.com	Used to identify unique users. Registers a unique ID that is used to generate statistical data on how the visitor uses the web site.	2 years
_gid	ltimindtree.com	This cookie name is asssociated with Google Universal Analytics. This appears to be a new cookie and as of Spring 2017 no information is available from Google. It appears to store and update a unique value for each page visited.	1 day
_gat	ltimindtree.com	Used by Google Analytics to throttle request rate	1 Day

Category 3: Functionality Cookies

We may use site performance cookies to remember your preferences for operational settings on our websites, so as to save you the trouble to reset the preferences every time you visit. For example, the cookie may recognize optimum video streaming speeds, or volume settings, or the order in which you look at comments to a posting on one of our forums. These cookies do not identify you as an individual and we don’t associate the resulting information with a cookie that does.

Cookie Name	Domain / Associated Domain / Third-Party Service	Description	Retention period
lang	ads.linkedin.com	Set by LinkedIn when a webpage contains an embedded “Follow us” panel. Preference cookies enable a website to remember information that changes the way the website behaves or looks, like your preferred language or the region that you are in.	When the browsing session ends
lang	linkedin.com	In most cases it will likely be used to store language preferences, potentially to serve up content in the stored language.	When the browsing session ends
YSC		Registers a unique ID to keep statistics of what videos from Youtube the user has seen.	2,488,902 Days

Category 4: Social Media Cookies

If you use social media or other third-party credentials to log in to our sites, then that other organization may set a cookie that allows that company to recognize you. The social media organization may use that cookie for its own purposes. The Social Media Organization may also show you ads and content from us when you visit its websites.

Ref links:

LinkedIn – https://www.linkedin.com/legal/privacy-policy Twitter – https://gdpr.twitter.com/en.html & https://twitter.com/en/privacy & https://help.twitter.com/en/rules-and-policies/twitter-cookies Facebook – https://www.facebook.com/business/gdpr Also, if you use a social media-sharing button or widget on one of our sites, the social network that created the button will record your action for its own purposes. Please read through each social media organization’s privacy and data protection policy to understand its use of its cookies and the tracking from our sites, and also how to control such cookies and buttons.

Category 5: Targeting/Advertising Cookies

We use tracking and targeting cookies, or ask other companies to do so on our behalf, to send you emails and show you online advertising, which meet your business and professional interests. If you have registered on our websites, we may send you emails, tailored to reflect the interests you have shown during your visits. We ask third-party advertising platforms and technology companies to show you our ads after you leave our sites (retargeting technology). This technology allows us to make our website services more interesting for you. Retargeting cookies are used to record anonymized movement patterns on a website. These patterns are used to tailor banner advertisements to your interests. The data used for retargeting is completely anonymous, and is only used for statistical analysis. No personal data is stored, and the use of the retargeting technology is subject to the applicable statutory data protection regulations. We also work with companies to reach people who have not visited our sites. These companies do not identify you as an individual, instead rely on a variety of other data to show you advertisements, for example, behavior across websites, information about individual devices, and, in some cases, IP addresses. Please refer below table to understand how these third-party websites collect and use information on our behalf and read more about their opt out options.

Cookie Name	Domain / Associated Domain / Third-Party Service	Description	Retention period
BizoID	ads.linkedin.com	These cookies are used to deliver adverts more relevant to you and your interests	183 days
iuuid	demandbase.com	Used to measure the performance and optimization of Demandbase data and reporting	2 years
IDE	doubleclick.net	This cookie carries out information about how the end user uses the website and any advertising that the end user may have seen before visiting the said website.	2,903,481 Days
UserMatchHistory	linkedin.com	This cookie is used to track visitors so that more relevant ads can be presented based on the visitor’s preferences.	60,345 Days
bcookie	linkedin.com	This is a Microsoft MSN 1st party cookie for sharing the content of the website via social media.	2 years
__asc	ltimindtree.com	This cookie is used to collect information on consumer behavior, which is sent to Alexa Analytics.	1 Day
__auc	ltimindtree.com	This cookie is used to collect information on consumer behavior, which is sent to Alexa Analytics.	1 Year
_gcl_au	ltimindtree.com	Used by Google AdSense for experimenting with advertisement efficiency across websites using their services.	3 Months
bscookie	linkedin.com	Used by the social networking service, LinkedIn, for tracking the use of embedded services.	2 years
tempToken	app.mirabelsmarketingmanager.com		When the browsing session ends
ELOQUA	eloqua.com	Registers a unique ID that identifies the user’s device upon return visits. Used for auto -populating forms and to validate if a certain contact is registered to an email group .	2 Years
ELQSTATUS	eloqua.com	Used to auto -populate forms and validate if a given contact has subscribed to an email group. The cookies only set if the user allows tracking .	2 Years
IDE	doubleclick.net	Used by Google Double Click to register and report the website user’s actions after viewing clicking one of the advertiser’s ads with the purpose of measuring the efficiency of an ad and to present targeted ads to the user.	1 Year
NID	google.com	Registers a unique ID that identifies a returning user’s device. The ID is used for targeted ads.	6 Months
PREF	youtube.com	Registers a unique ID that is used by Google to keep statistics of how the visitor uses YouTube videos across different web sites.	8 months
test_cookie	doubleclick.net	This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor’s browser supports cookies.	1,073,201 Days
UserMatchHistory	linkedin.com	Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor’s preferences.	29 days
VISITOR_INFO1_LIVE	youtube.com		179 days

Third party companies	Purpose	Applicable Privacy/Cookie Policy Link
Alexa	Show targeted, relevant advertisements	https://www.oracle.com/legal/privacy/marketing-cloud-data-cloud-privacy-policy.html To opt out: http://www.bluekai.com/consumers.php#optout
Eloqua	Personalized email based interactions	https://www.oracle.com/legal/privacy/marketing-cloud-data-cloud-privacy-policy.html To opt out: https://www.oracle.com/marketingcloud/opt-status.html
CrazyEgg	CrazyEgg provides visualization of visits to website.	https://help.crazyegg.com/article/165-crazy-eggs-gdpr-readiness Opt Out: DAA: https://www.crazyegg.com/opt-out
DemandBase	Show targeted, relevant advertisements	https://www.demandbase.com/privacy-policy/ Opt out: DAA: http://www.aboutads.info/choices/
LinkedIn	Show targeted, relevant advertisements and re-targeted advertisements to visitors of LTI websites	https://www.linkedin.com/legal/privacy-policy Opt-out: https://www.linkedin.com/help/linkedin/answer/62931/manage-advertising-preferences
Google	Show targeted, relevant advertisements and re-targeted advertisements to visitors of LTI websites	https://policies.google.com/privacy Opt Out: https://adssettings.google.com/ NAI: http://optout.networkadvertising.org/ DAA: http://optout.aboutads.info/
Facebook	Show targeted, relevant advertisements	https://www.facebook.com/privacy/explanation Opt Out: https://www.facebook.com/help/568137493302217
Youtube	Show targeted, relevant advertisements. Show embedded videos on LTI websites	https://policies.google.com/privacy Opt Out: https://adssettings.google.com/ NAI: http://optout.networkadvertising.org/ DAA: http://optout.aboutads.info/
Twitter	Show targeted, relevant advertisements and re-targeted advertisements to visitors of LTI websites	https://twitter.com/en/privacy Opt out: https://twitter.com/personalization DAA: http://optout.aboutads.info/

Save settings

Overview

Partners

What’s hot

Industries

Roles

Knowledge hub

About Fosfor

The Fosfor Decision Cloud