Skip to main content

Privacera Documentation

Configure EMR with Privacera Platform

This topic shows how to configure EMR with Privacera using Privacera Manager.

Kerberos required for EMR FGAC or OLAC

Note

To support Privacera FGAC or OLAC, the EMR application must be configured with Kerberos.

  1. SSH to the instance as USER.

  2. Run the following commands:

    cd ~/privacera/privacera-manager
    cp config/sample-vars/vars.emr.yml config/custom-vars/
    vi config/custom-vars/vars.emr.yml
  3. Edit the properties to set the values. For property details and descriptions, see Properties to configure EMR.

  4. If your cluster was running while External Hive Metastore was down, and you are unable to connect to it, restart the following three servers:

    sudo systemctl restart hive-hcatalog-server
    sudo systemctl restart hive-server2
    sudo systemctl restart presto-server
  5. Run the following commands:

    cd ~/privacera/privacera-manager
    ./privacera-manager.sh update

    After the update is finished, all the cloud-formation JSON template files and privacera_cust_conf.zip will be available at the path, ~/privacera/privacera-manager/output/emr.

  6. Configure and run the following in AWS instance where Privacera is installed.

    1. (Optional) Create IAM roles using the emr-roles-creation-template.json template. Run the following command.

      aws --region <AWS-REGION> cloudformation create-stack --stack-name privacera-emr-role-creation --template-body file://emr-roles-creation-template.json --capabilities CAPABILITY_NAMED_IAM

      Note

      This will create IAM roles with minimal permissions. You can add bucket permissions into respective IAM roles as per your requirements.

    2. (Optional) Create Security Configurations using the emr-security-config-template.json template. Run the following command.

      aws --region <AWS-REGION> cloudformation create-stack --stack-name privacera-emr-security-config-creation  --template-body file://emr-security-config-template.json
    3. Confirm the privacera_cust_conf.zip file has been copied to the location specified in EMR_CUST_CONF_ZIP_PATH.

    4. Create EMR using the emr-template.json template. Run the following command.

      aws --region <AWS-REGION> cloudformation create-stack --stack-name privacera-emr-creation  --template-body file://emr-template.json

      Note

      If you are upgrading EMR to version 6.4 and higher from EMR version <=6.3 to use Trino plug-in, then you must re-create the EMR security configuration based on the new template generated via PM since the security configuration has trino user newly added

Note

  • For PrestoDB, secrets encryption of Solr authentication password is not supported. However, the properties file where the password resides is accessible only to the presto service user, hence it is invulnerable.

  • If your cluster was running while External Hive Metastore was down, and you are unable to connect to it, restart the following three servers:

    sudo systemctl restart hive-hcatalog-server
    sudo systemctl restart hive-server2
    sudo systemctl restart presto-server
    

Property

Description

Example

EMR_ENABLE

Enable EMR template creation.

true

EMR_CLUSTER_NAME

Define a unique name for the EMR cluster.

Privacera-EMR

EMR_CREATE_SG

Set this to true if you don't have existing security groups and want Privacera Manager to take care of adding security group creation steps in the EMR CF template.

false

EMR_MASTER_SG_ID

If EMR_CREATE_SG is false, set this property. Security Group ID for EMR Master Node Group.

sg-xxxxxxx

EMR_SLAVE_SG_ID

If EMR_CREATE_SG is false, set this property. Security Group ID for EMR Slave Node Group.

sg-xxxxxxx

EMR_SERVICE_ACCESS_SG_ID

If EMR_CREATE_SG is false, set this property. Security Group ID for EMR ServiceAccessSecurity. Fill this property only if you are creating EMR in a Private Network.

sg-xxxxxxx

EMR_SG_VPC_ID

If EMR_CREATE_SG is true, set this property. VPC ID in which you want to create the EMR Cluster.

vpc-xxxxxxxxxxx

EMR_MASTER_SG_NAME

If EMR_CREATE_SG is true, set this property. Security Group Name for EMR Master Node Group. The security group name will be added to the emr-template.json.

priv-master-sg

EMR_SLAVE_SG_NAME

If EMR_CREATE_SG is true, set this property. Security Group Name for EMR Slave Node Group. The security group name will be added to the emr-template.json.

priv-slave-sg

EMR_SERVICE_ACCESS_SG_NAME

If EMR_CREATE_SG is true, set this property. Security Group Name for EMR ServiceAccessSecurity. The security group name will be added to the emr-template.json. Fill this property only if you are creating EMR in a Private Network.

priv-private-sg

EMR_SUBNET_ID

Subnet ID

EMR_KEYPAIR

An existing EC2 key pair to SSH into the master node of the cluster.

privacera-test-pair

EMR_EC2_MARKET_TYPE

Set market type as SPOT or ON_DEMAND.

SPOT

EMR_EC2_INSTANCE_TYPE

Set the instance type. Instances can be of different types such as m5.xlarge, r5.xlarge and so on.

m5.large

EMR_MASTER_NODE_COUNT

Node count for Master.

Set value to 3 to enable EMR’s Multiple Master Node feature.

1

EMR_CORE_NODE_COUNT

Node count for Core. The number of cores can be 1, 2 and so on.

1

EMR_VERSION

Version of EMR.

emr-x.xx.x

EMR_EC2_DOMAIN

Domain used by the nodes. It depends on EMR Region, for example, ".ec2.internal" is for us-east-1.

.ec2.internal

EMR_USE_STS_REGIONAL_ENDPOINTS

Set the property to enable/disable regional endpoints for S3 requests.

Default value is false.

true

EMR_TERMINATION_PROTECT

Set to enable/disable termination protection.

true

EMR_LOGS_PATH

S3 location for storing EMR logs.

s3://privacera-logs-bucket/

EMR_KERBEROS_ENABLE

Set to true if you want to enable kerberization on EMR.

false

EMR_KDC_ADMIN_PASSWORD

If EMR_KERBEROS_ENABLE is true, set this property. The password used within the cluster for the kadmin service.

If Multiple Master Node is enabled, then configure this with your external KDC admin password.

EMR_CROSS_REALM_PASSWORD

If EMR_KERBEROS_ENABLE is true, set this property. The cross-realm trust principal password, which must be identical across realms.

EMR_SECURITY_CONFIG

Name of the Security Configurations created for EMR. This can be a pre-created configuration, or Privacera Manager can generate a template through which you can create this configuration.

EMR_KERB_TICKET_LIFETIME

Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The period for which a Kerberos ticket issued by the cluster’s KDC is valid. Cluster applications and services auto-renew tickets after they expire.

EMR_KERB_TICKET_LIFETIME: 24

EMR_KERB_REALM

Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The Kerberos realm name for the other realm in the trust relationship.

If Multiple Master Node is enabled, then configure this with your external KDC realm

EMR_KERB_DOMAIN

Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The domain name of the other realm in the trust relationship.

If Multiple Master Node is enabled, then configure this with your external KDC realm

EMR_KERB_ADMIN_SERVER

Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The fully qualified domain name (FQDN) and an optional port for the Kerberos admin server in the other realm. If a port is not specified, 749 is used.

If Multiple Master Node is enabled, then configure this with your external KDC admin server.

EMR_KERB_KDC_SERVER

Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The fully qualified domain name (FQDN) and an optional port for the KDC in the other realm. If a port is not specified, 88 is used.

If Multiple Master Node is enabled, then configure this with your external KDC server.

EMR_AWS_ACCT_ID

AWS Account ID where EMR Cluster resides

9999999

EMR_DEFAULT_ROLE

Default role attached to EMR Cluster for performing cluster-related activities. This should be a pre-created role.

EMR_DefaultRole

EMR_ROLE_FOR_CLUSTER_NODES

The IAM Role will be attached to each node in the EMR Cluster.

This should have only minimal permissions for downloading the privacera_cust_conf.zip and basic EMR capabilities. It can be an existing one, if not, you can use the IAM role CF template to generate it after the Privacera Manager update.

restricted_node_role

EMR_USE_SINGLE_ROLE_FOR_APPS

If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. Create a Single IAM Role that will be used by All EMR Applications.

true

EMR_ROLE_FOR_APPS

If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. IAM Role name which will be used by all EMR Apps

app_data_access_role

EMR_ROLE_FOR_SPARK

If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. Create multiple IAM Roles to be used by specific applications. Set EMR_USE_SINGLE_ROLE_FOR_APPS to be false. IAM Role name which will be used by Spark Application (Dataserver) for data access.

spark_data_access_role

EMR_ROLE_FOR_HIVE

If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. IAM Role name which will be used by Hive Application for data access.

hive_data_access_role

EMR_ROLE_FOR_PRESTO

If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. IAM Role name which will be used by Presto Application for data access.

presto_data_access_role

EMR_HIVE_METASTORE

Metastore type. e.g. "glue", "hive" (For external hive-metastore)

glue

EMR_HIVE_METASTORE_PATH

S3 location for hive metastore

s3://hive-warehouse

EMR_HIVE_METASTORE_CONNECTION_URL

If EMR_HIVE_METASTORE is hive, set this property. JDBC Connection URL for connecting to hive.

jdbc:mysql://<jdbc-host>:3306/<hive-db-name>?createDatabaseIfNotExist=true

EMR_HIVE_METASTORE_CONNECTION_DRIVER

If EMR_HIVE_METASTORE is hive, set this property. JDBC Driver Name

org.mariadb.jdbc.Driver

EMR_HIVE_METASTORE_CONNECTION_USERNAME

If EMR_HIVE_METASTORE is hive, set this property. JDBC UserName

hive

EMR_HIVE_METASTORE_CONNECTION_PASSWORD

If EMR_HIVE_METASTORE is hive, set this property. JDBC Password

StRong@PassW0rd

EMR_HIVE_SERVICE_NAME

Custom hive service name for hive application in EMR

teamA_policy

EMR_TRINO_HIVE_SERVICE_NAME

Custom hive service name for trino application in EMR

teamB_policy

EMR_SPARK_HIVE_SERVICE_NAME

Custom hive access service name for spark applications in EMR

teamC_policy

EMR_APP_SPARK_OLAC_ENABLE

To install Spark application with Privacera plugin, set the property to true. OLAC is known as Object Level Access Control.

Note

  • Recommended when complete access control on the objects in AWS S3 is required.

  • When the property is set to true, s3 and s3n protocols will not be supported on EMR clusters while running Spark queries.

true

EMR_APP_SPARK_FGAC_ENABLE

To install Spark application with Privacera plugin, set the property to true. FGAC is known as Fine Grained Access Control for Table and Column.

Note

Recommended for compliance purposes, since the whole cluster will still have direct access to AWS S3 data.

false

EMR_APP_PRESTO_DB_ENABLE

To install PrestoDB application with Privacera plugin, set the property to true.

PrestoDB and Trino are mutually exclusive. Only one should be enabled at a time.

false

EMR_APP_PRESTO_SQL_ENABLE

To install Trino application with Privacera plugin, set the property to true.

PrestoDB and Trino are mutually exclusive. Only one should be enabled at a time.

Note

Trino is supported for EMR versions 6.1.0 and higher.

Note

If the EMR version is 6.4.0, setting this flag installs the Trino plugin.

false

EMR_APP_HIVE_ENABLE

To install Hive application with Privacera plugin, set the property to true.

true

EMR_APP_ZEPPELIN_ENABLE

To install Zeppelin application, set the property to true.

true

EMR_APP_LIVY_ENABLE

To install Livy application, set the property to true.

true

EMR_CUST_CONF_ZIP_PATH

A path where the privacera_cust_conf.zip file will be placed should be added. Privacera Manager will generate a privacera_cust_conf.zip under ~/privacera/privacera-manager/output/emr folder. This privacera_cust_conf.zip needs to be placed at an s3 or any https location from which the EMR cluster can download it.

s3://privacera-artifacts/

EMR_SPARK_ENABLE_VIEW_LEVEL_ACCESS_CONTROL

Set the property to true to enable view-level column masking and row filter for SparkSQL. The property can be used only when you set EMR_APP_SPARK_FGAC_ENABLE to true.

false

EMR_RANGER_IS_FALLBACK_SUPPORTED

Use the property to enable/disable the fallback behavior to the privacera_files and privacera_hive services. It confirms whether the resources files should be allowed/denied access to the user.

To enable the fallback, set to true; to disable, set to false.

true

EMR_SPARK_DELTA_LAKE_ENABLE

Set this property to true to enable Delta Lake on EMR Spark.

true

EMR_SPARK_DELTA_LAKE_CORE_JAR_DOWNLOAD_URL

Download URL of Delta Lake core JAR. The Delta Lake core JAR has dependency with Spark version.

You have to find the appropriate version for your EMR. See Delta Lake compatibility with Apache Spark.

Update this property with the download URL for the appropriate Delta Lake core JAR download link and update this property with this value. See the Maven release page for Delta Core.

For example, for Spark version 3.1.x, the download URL is https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.0.1/delta-core_2.12-1.0.1.jar.

https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.0.1/delta-core_2.12-1.0.1.jar

EMR_SPARK_DELTA_LAKE_STORAGE_JAR_DOWNLOAD_URL

If you are using an EMR version 6.10.0 and above, then you need to export this additional variable. Download URL of Delta Lake storage JAR. The Delta Lake storage JAR has dependency with the Spark version. You have to find the appropriate version for your EMR. See Delta Lake compatibility with Apache Spark.

Update this property with the download URL for the appropriate Delta Lake storage JAR download link and update this property with this value. See the Maven release page for Delta Storage.

For example, for Spark version 3.1.0 and above, the download URL is as follows:

https://repo1.maven.org/maven2/io/delta/delta-storage/2.1.1/delta-storage-2.1.1.jar

https://repo1.maven.org/maven2/io/delta/delta-storage/2.2.0/delta-storage-2.2.0.jar

If EMR's Multiple Master Node feature is enabled, the following properties must be configured:

EMR_HUE_DB_ENGINE

Set this property for the external database engine that is used to create Hue's external database.

mysql

EMR_HUE_DB_HOST

Set this property for external database host address.

EMR_HUE_DB_PORT

Set this property for external database port.

3306

EMR_HUE_DB_USER

Set this property for database user name to access Hue’s external database.

EMR_HUE_DB_PASS

Set the database user password to access Hue’s external database.

EMR_HUE_DB_NAME

Set this property for Hue’s external database name.

This database should be created and exist.

EMR_OOZIE_JDBC_DRIVER

Set this property for external JDBC driver.

org.mariadb.jdbc.Driver

EMR_OOZIE_JDBC_URL

Set the JDBC URL to connect to the external database.

Example: jdbc:mysql://10.211.245.255:3306/oozie_db?createDatabaseIfNotExist=true

EMR_OOZIE_JDBC_USER

Set this property for database user name to access Oozie’s external database.

EMR_OOZIE_JDBC_PASS

Set the Database user password to access Oozie’s external database.