Unsere Technologiepartner

Lorem ipsum dolor sit amet

Kategorien

Processing binary files (pdf, eml, xlsx, jpg) on Snowflake using Python

What happens when key business information is locked away in PDFs, emails, and images? How do you make the insights from these files available to your business? Where do you start if you are on Snowflake?

On Snowflake, the easiest way is to first review the Anaconda Snowflake Channel if there are any Python libraries that can process your type of data. All libraries in this channel are natively integrated in Snowflake, meaning security and package dependency management is already taken care of.

These are the first steps to get started:

  1. Ensure your files are made available in a Snowflake stage (docs)
  2. Find a library on the Anaconda Snowflake Channel that can handle your type of data.

While performing step 2, you will usually find “simple usage” examples after clicking “dev” or “doc” for more information on the library.

Here is an example code for PDFs, using the library PyPDF2, which is also a common choice:

from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())

 

Troubleshooting Tip:

Pay attention to the library version number supported on the Anaconda. The documentation link provides “simple usage” code for the most recent version of the library, which might not work with your Snowflake setup. In these cases, simply navigate the docs until you find an example code for your library version.

Most Python libraries are designed to be executed on a file system. Since direct file system access is not possible when reading from a Snowflake stage, we need to use Snowflake’s extension of the binary library to first read the file into memory. For simplicity, this Snowflake specific step can be separated out to the new function, which we will call read_file_using_binary().

We will also move the main part of the code to reside within a function (read_pdf()), as all Python code need to reside within functions.

from io import BytesIO
from snowflake.snowpark.files import SnowflakeFile
from PyPDF2 import PdfReader
def read_file_using_binary(binary_file_path):
    with SnowflakeFile.open(binary_file_path, 'rb') as binary_file:
        file = BytesIO(binary_file.read())
        return file
def read_pdf(pdf_file_path):
    file = read_file_using_binary(pdf_file_path)
    reader = PdfReader(file)
    page = reader.pages[0]
    return page.extract_text()

Note that the function we added, read_file_using_binary(), uses the SnowflakeFile library from Snowpark. This is required for all binary files on Snowflake. Further information on the docs.

Once you have adjusted your “simple usage” snippet like above, you can register this user defined function for use in SQL queries, using the following SQL statement:

create or replace function read_pdf (pdf_file_name string)
returns STRING
language python
runtime_version = '3.11'
handler = 'read_pdf'
packages = ('PyPDF2', 'snowflake')
as
$$
[Insert the Python code snippet from above]
$$;

Note that non-base Python packages need to be specified to be “installed in memory” under the “packages” argument, like PyPDF2 and snowflake in this example. There is minimal lag during execution, even when referencing a large library. You can specify the Python version for each Python function you register separately.

Now that the function is registered as a User Defined Function (UDF), the next step is to prepare the query. For dynamically sending files to a Python function, you need to provide access to the data on your external stage.

In this example, we create a temporary access URL, valid for a maximum of 24 hours, for security reasons:

select build_scoped_file_url('@function_test_data', 'read_pdf_input.pdf') as pdf_url;

Read more on access URLs here.

Before you proceed, try to run this using the Snowflake UI. The output should be a hyperlink you can click to download the file.

It will typically look like this:

https://consulting.eu-central-1.snowflakecomputing.com/api/files/01b42503-0102...

 

If this worked, add the function call:

select read_pdf(build_scoped_file_url('@function_test_data', 'read_pdf_input.pdf')) as pdf_text;

 

This time the output should be contents from your pdf file:

Lorem ipsum dolor sit amet, consectetur adipiscing elit...

 

Since this is a simple usage example, it is indeed only for quickly testing the capabilities of a particular library. It should enable you to answer questions such as:

  1. Is there a library I can easily use on Snowflake, that can read my type of file

  2. Will the library be able to capture the necessary granularity and accuracy (example for PDFs, are white spaces, special characters and positions properly reflected?)

  3. By looking at the outputs, does it seem feasible to transform this output into a table with columns and rows?

  4. Is the processing time, and hence credit consumption, seem acceptable?

After you try out a few different libraries and find the right one, you’re all set to start the real work. Once done, pull out important business info from those (previous ) tricky files and use it to make better decisions for your business.