
Latest Trends in Snowflake
Snowflake is a platform which perhaps everyone in the organization has heard of, today. It started off as a data warehouse providing OLAP solutions, but over the last 10 years of its existence, it has scaled up to a full-fledged cloud data platform which is extremely feature rich. This article is my attempt at talking about a couple of the latest offerings and domains where Snowflake has been focusing on. Let’s dive right in!
Snowpark
One of the most interesting offering Snowflake is ready to offer its customers is Snowpark. In the easiest terms, Snowpark offers the most convenient way to interact with data on Snowflake using native programming language (Python/Java/Scala) constructs, while utilising Snowflake’s processing capabilities.
Key Benefits
- No need to predict and right-size the infrastructure. Snowpark utilises the readily available Snowflake Virtual Warehouses to execute the operations. This is particularly useful while using codes which require clusters to work. Usually there is always a 5/10-minute period over which the clusters get spun-up, but with Snowpark this process is almost much quicker due to the existing Snowflake VWHs.
- Seamless accessibility to 3rd party libraries ensures that even complex operations such as running ML algorithms on data can now be done directly on Snowflake.
- The ability to create custom user-defined functions (UDFs) in native programming languages and push them directly to Snowflake enables reusable code modules to be defined and called easily when needed by clients.
- Such a framework is also expected to bring in a high degree of standardization in the data engineering architectures utilising Snowflake. (Who called Snowflake just a Data warehouse, now?)
Examples of How Snowpark API for Python Works
In the back end, DataFrame operations are transparently converted into SQL queries that get pushed down and benefit from the high-performance, scalable Snowflake engine. UDFs are another key feature of Snowpark. They can be directly pushed to Snowflake via Python and called directly on Python or on Snowflake. These UDFs can even contain running of some ML models on the data and need not have only simple transformation operations. In the rapidly expanding data ecosystem today, Snowpark is here to expand the horizons of data engineering and machine learning.
Data Governance in Snowflake
With the ever-increasing volume of data and the users who access it, data governance is very aptly a buzzword these days. Simply put, data governance is all about who can take what action, upon what data, in what situations and using what methods. The features available on Snowflake to enable data governance can be broadly bucketed into 3 buckets:
Knowing the sensitive data:
1. Classification
Snowflake allows classification of potentially sensitive data on snowflake, inside a table. Data engineers or administrators can run functions such as EXTRACT_SEMANTIC_CATEGORIES in order to scan all columns in a table/view. This function runs an algorithm on the data and outputs a JSON object with the semantic category under which the data was found to be (name/email/age) and even the probability score indicating the likelihood that the algorithm derived the correct value. The output can then be analysed, and columns masked as per requirement. A sample output is shown below, where age and email_address are column names:
2. Object Tagging
Once the analysis on columns of a table is done, the same output can be used with the function ASSOCIATE_SEMANTIC_CATEGORY_TAGS to tag the respective columns with the semantic category found (age/email/name). Tags can also be manually applied to schemas, tables, views, or columns using the CREATE_TAG command. Tags can then be applied to new or existing objects as shown in the example below.
3. Access History and Object Dependencies
The data on Snowflake can also be tracked based in the tags, and access_history view can be used by administrators to track which data was accessed/modified when and by whom. This helps to track the lineage of sensitive data. The object_dependencies view can be used to track which objects were created using data from other objects.
Protecting the Data
1. Masking Policy
Dynamic data masking is a column-level security feature which allows certain users with privileges to mask plain-text data in table/view columns at query runtime to prevent visibility to analysts and other users. Masking policies can also be created. Application of masking policies can then be done on schemas as well as views.
A new feature of Snowflake, which is yet to be released to the public, will allow adding masking policies to a tag, hence allowing application of the masking policy automatically when an object is assigned to a tag.
2. Tokenization
Snowflake allows tokenization to be applied to objects such as columns. The fundamental idea behind this is application of a masking policy, but the interesting part is that hashing algorithms like SHA-2 can be used to tokenize column values in case the role accessing it does not have access. Snowflake also allows external tokenization, which is accomplished by using masking policies with external functions (created using AWS/GCP/Azure). This enables data to be tokenised before loading into Snowflake, and only the appropriate audiences can then detokenize it during query runtime.
3. Row-level Security
Snowflake allows applying row-level policies to limit the rows visible to a role while querying. A good use case for this is to limit the sales figures which a sales manager can see only to those of the manager’s region. This can be done by creating a mapping table and to check which region the manager falls in and creating and applying the policy to the table. At query runtime, Snowflake will evaluate the policy expression and return only the rows which satisfy the policy.
The points covered above are, by no means, the only innovation and development being done by Snowflake, but it’s all which could be fit here. Snowflake’s Unistore offering is another thing which we should look forward too. In case you are more curious, feel free to explore the links given below. Feel free to get in touch with me if you have any ideas to share! Happy learning!
References:
- https://docs.snowflake.com/en/developer-guide/snowpark/index.html
- https://www.phdata.io/blog/what-is-snowpark/
- https://www.snowflake.com/blog/snowpark-python-innovation-available-all-snowflake-customers/
- https://github.com/Snowflake-Labs/snowpark-python-demos
- https://docs.snowflake.com/en/user-guide/security-column.html
- https://docs.snowflake.com/en/user-guide/tag-based-masking-policies.html
- https://docs.snowflake.com/en/user-guide/governance-classify-using.html
