Databricks user guide for Privacera Platform

Spark Fine-grained Access Control (FGAC)

Enable View-level access control

Edit the SparkConfig of your existing Privacera-enabled Databricks Cluster. See Configure Databricks Spark Fine-Grained Access Control Plugin [FGAC] [Python, SQL].

Add the following property:

spark.hadoop.privacera.spark.view.levelmaskingrowfilter.extension.enable true

Save and restart the Databricks cluster.

Apply View-level access control

To CREATE VIEW in Spark Plug-In, you need the permission for DATA_ADMIN.

The source table on which you are going to create a view requires DATA_ADMIN access in Ranger policy.

Use Case

Let’s take a use case where we have 'employee_db' database and two tables inside it with below data:

#Requires create privilege on the database enabled by default;
create database if not exists employee_db;

Create two tables.

#Requires privilege for table creation;
create table if not exists employee_db.employee_data(id int,userid string,country string);
create table if not exists employee_db.country_region(country string,region string);

Insert test data.

#Requires update privilege for tables;

insert into employee_db.country_region values ('US','NA'), ('CA','NA'), ('UK','UK'), ('DE','EU'), ('FR','EU'); 
insert into employee_db.employee_data values (1,'james','US'),(2,'john','US'), (3,'mark','UK'), (4,'sally-sales','UK'),(5,'sally','DE'), (6,'emily','DE');

#Requires select privilege for columns;
select * from employee_db.country_region; 
select * from employee_db.employee_data;

Now try to create a View on top of above two tables created, we will get ERROR as below:

create view employee_db.employee_region(userid, region) as select e.userid, cr.region from employee_db.employee_data e, employee_db.country_region cr where e.country = cr.country;

Error: Error while compiling statement: 
FAILED: HiveAccessControlException 
Permission denied: user [emily] does not have [DATA_ADMIN] privilege on [employee_db/employee_data] (state=42000,code=40000)

Create a view policy for table on employee_db.employee_region as shown in the above image.
Now create a policy as shown above in the image and try to execute the same query the query, it will pass through.
Note
Granting Data_admin privileges on the resource implicitly grants Select privilege on the same resource.

Alter View

#Requires alter permission on the view;
ALTER VIEW employee_db.employee_region AS  select e.userid, cr.region 
from employee_db.employee_data e, employee_db.country_region cr where 
e.country = cr.country;

Rename View

#Requires alter permission on the view;
ALTER VIEW  employee_db.employee_region RENAME to employee_db.employee_region_renamed;

Drop View

#Requires Drop permission on the view;
DROP VIEW employee_db.employee_region_renamed;

Row-Level Filter

create view if not exists employee_db.employee_region(userid, region) as select 
e.userid, cr.region from employee_db.employee_data e, 
employee_db.country_region cr where e.country = cr.country;

select * from employee_db.employee_region;

Column Masking

select * from employee_db.employee_region;

Access AWS S3 using Boto3 from Databricks

This section describes how to use the AWS SDK (Boto3) for Privacera Platform to access AWS S3 file data through a Privacera DataServer proxy.

Prerequisites

Ensure that the following prerequisites are met:

Put the iptables in the Databricks init-script.
To enable boto3 access control in your Databricks environment, add the following command to open port 8282 for outgoing connections:
```
sudo iptables -I OUTPUT 1 -p tcp -m tcp --dport 8282 -j ACCEPT
```
Restart the Databricks cluster.

We pass the iptables command as shown below through the Privacera Manager properties in the vars.databricks.plugin.yml file and run the update privacera manager command.

DATABRICKS_POST_PLUGIN_COMMAND_LIST: 
- echo "Completed Installation" 
- iptable command goes here

Accessing AWS S3 files

The following commands must be run in a notebook for Databricks:

Install the AWS Boto3 libraries
```
pip install boto3
```
Import the required libraries
```
import boto3
```

Fetch the DataServer certificate

If SSL is enabled on the dataserver, the port is 8282.

%sh
sudo iptables -I OUTPUT 1 -p tcp -m tcp --dport 8282 -j ACCEPT
dirname="/tmp/lib3"
mkdir -p -- "$dirname"
DS_URL="https://{DATASERVER_EC2_OR_K8S_LB_URL}:{DAS_SSL_PORT}"
#Sample url as shown below
#DS_URL="https://10.999.99.999:8282"
DS_CERT_FILE="$dirname/ds.pem"

curl -k -H "connection:close" -o "${DS_CERT_FILE}" 
"${DS_URL}/services/certificate"

Access the AWS S3 files

def check_s3_file_exists(bucket, key, access_key, secret_key, endpoint_url, dataserver_cert, region_name):
exec_status = False
access_key = access_key
secret_key = secret_key
endpoint_url = endpoint_url
try:
    s3 = boto3.resource(service_name='s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key, endpoint_url=endpoint_url, region_name=region_name, verify=dataserver_cert)
    print(s3.Object(bucket_name=bucket, key=key).get()['Body'].read().decode('utf-8'))
    exec_status = True
  except Exception as e:
    print("Got error: {}".format(e))
  finally:
    return exec_status  
  
def read_s3_file(bucket, key, access_key, secret_key, endpoint_url, dataserver_cert, region_name):
  exec_status = False
  access_key = access_key
  secret_key = secret_key
  endpoint_url = endpoint_url
  try:
    s3 = boto3.client(service_name='s3', aws_access_key_id=access_key, aws_secret_access_key=secret_key, endpoint_url=endpoint_url, region_name=region_name, verify=dataserver_cert)
    obj = s3.get_object(Bucket=bucket, Key=key)
    print(obj['Body'].read().decode('utf-8'))
    exec_status = True
  except Exception as e:
    print("Got error: {}".format(e))
  finally:
    return exec_status
  
readFilePath = "file data/data/format=txt/sample/sample_small.txt"
bucket = "infraqa-test"
#platform
access_key = "${privacera_access_key}"
secret_key = "${privacera_secret_key}"
endpoint_url = "https://${DATASERVER_EC2_OR_K8S_LB_URL}:${DAS_SSL_PORT}"
#sample value as shown below
endpoint_url = "https://10.999.99.999:8282"
priv_dataserver_cert = "/tmp/lib3/ds.pem"
region_name = "us-east-1"
print(f"got file===== {readFilePath} ============= bucket= {bucket}")
status = check_s3_file_exists(bucket, readFilePath, access_key, secret_key, endpoint_url, priv_dataserver_cert, region_name)

Access Azure file using Azure SDK from Databricks

This section describes how to use the Azure SDK for Privacera Platform to access Azure DataStorage/Datalake file data through a Privacera DataServer proxy.

Prerequisites

Ensure that the following prerequisites are met:

Put the iptables in the Databricks init-script.
To enable boto3 access control in your Databricks environment, add the following command to open port 8282 for outgoing connections:
```
sudo iptables -I OUTPUT 1 -p tcp -m tcp --dport 8282 -j ACCEPT
```
Restart the Databricks cluster.

We pass the iptables command as shown below through the Privacera Manager properties in the vars.databricks.plugin.yml file and run the update privacera manager command.

DATABRICKS_POST_PLUGIN_COMMAND_LIST: 
- echo "Completed Installation" 
- iptable command goes here

Accessing Azure files

The following commands must be run in a notebook for Databricks:

Install the Azure SDK libraries
```
pip install azure-storage-file-datalake
```

Import the required libraries

import os, uuid, sys
from azure.storage.filedatalake import DataLakeServiceClient
from azure.core._match_conditions import MatchConditions
from azure.storage.filedatalake._models import ContentSettings

Fetch the DataServer certificate

If SSL is enabled on the dataserver, the port is 8282.

sudo iptables -I OUTPUT 1 -p tcp -m tcp --dport 8282 -j ACCEPT
dirname="/tmp/lib3"
mkdir -p -- "$dirname"
DS_URL="https://{DATASERVER_EC2_OR_K8S_LB_URL}:{DAS_SSL_PORT}"
#Sample url as shown below
#DS_URL="https://10.999.99.999:8282"
DS_CERT_FILE="$dirname/ds.pem"

curl -k -H "connection:close" -o "${DS_CERT_FILE}" 
"${DS_URL}/services/certificate"

Initialize the account storage through connection string method

def initialize_storage_account_connect_str(my_connection_string):
    
    try:  
        global service_client
        print(my_connection_string)
        os.environ['REQUESTS_CA_BUNDLE'] = '/tmp/lib3/ds.pem'
        service_client = DataLakeServiceClient.from_connection_string(conn_str=my_connection_string, headers={'x-ms-version': '2020-02-10'})
    
    except Exception as e:
        print(e)

Prepare the connection string

def prepare_connect_str():
    try:
        
        connect_str = "DefaultEndpointsProtocol=https;AccountName=${privacera_access_key}-{storage_account_name};AccountKey=${base64_encoded_value_of(privacera_access_key|privacera_secret_key)};BlobEndpoint=https://${DATASERVER_EC2_OR_K8S_LB_URL}:${DAS_SSL_PORT};"
        
       # sample value is shown below
       #connect_str = "DefaultEndpointsProtocol=https;AccountName=MMTTU5Njg4Njk0MDAwA6amFpLnBhdGVsOjE6MTY1MTU5Njg4Njk0MDAw==-pqadatastorage;AccountKey=TVRVNUTU5Njg4Njk0MDAwTURBd01UQTZhbUZwTG5CaGRHVnNPakU2TVRZMU1URTJOVGcyTnpVMTU5Njg4Njk0MDAwVZwLzNFbXBCVEZOQWpkRUNxNmpYcjTU5Njg4Njk0MDAwR3Q4N29UNFFmZWpMOTlBN1M4RkIrSjdzSE5IMFZic0phUUcyVHTU5Njg4Njk0MDAwUxnPT0=;BlobEndpoint=https://10.999.99.999:8282;"

        return connect_str
    except Exception as e:
      print(e)

Define a sample access method to get Azure file and directories

def list_directory_contents(connect_str):
    try:
        initialize_storage_account_connect_str(connect_str)
        
        file_system_client = service_client.get_file_system_client(file_system="{storage_container_name}")
        #sample values as shown below
        #file_system_client = service_client.get_file_system_client(file_system="infraqa-test")

        paths = file_system_client.get_paths(path="{directory_path}")
        #sample values as shown below
        #paths = file_system_client.get_paths(path="file data/data/format=csv/sample/")

        for path in paths:
            print(path.name + '\n')

    except Exception as e:
      print(e)

To verify that the proxy is functioning, call the access methods

connect_str = prepare_connect_str()
list_directory_contents(connect_str)

Privacera Documentation

Table of ContentsTable of Contents

Databricks user guide for Privacera Platform

Spark Fine-grained Access Control (FGAC)

Note

Access AWS S3 using Boto3 from Databricks

Prerequisites

Accessing AWS S3 files

Access Azure file using Azure SDK from Databricks

Prerequisites

Accessing Azure files