Skip to content

AWS EMR#

This topic shows how to configure AWS EMR with Privacera using Privacera Manager.

Configuration

  1. SSH to the instance as USER.

  2. Run the following commands.

    cd ~/privacera/privacera-manager
    cp config/sample-vars/vars.emr.yml config/custom-vars/
    vi config/custom-vars/vars.emr.yml
    
  3. Edit the following properties.

    Property Description Example
    EMR_ENABLE Enable EMR template creation. true
    EMR_CLUSTER_NAME Define a unique name for the EMR cluster. Privacera-EMR
    EMR_CREATE_SG Set this to true if you don't have existing security groups and want Privacera Manager to take care of adding security group creation steps in the EMR CF template. false
    EMR_MASTER_SG_ID If EMR_CREATE_SG is false, set this property. Security Group ID for EMR Master Node Group. sg-xxxxxxx
    EMR_SLAVE_SG_ID If EMR_CREATE_SG is false, set this property. Security Group ID for EMR Slave Node Group. sg-xxxxxxx
    EMR_SERVICE_ACCESS_SG_ID If EMR_CREATE_SG is false, set this property. Security Group ID for EMR ServiceAccessSecurity. Fill this property only if you are creating EMR in a Private Network. sg-xxxxxxx
    EMR_SG_VPC_ID If EMR_CREATE_SG is true, set this property. VPC ID in which you want to create the EMR Cluster. vpc-xxxxxxxxxxx
    EMR_MASTER_SG_NAME If EMR_CREATE_SG is true, set this property. Security Group Name for EMR Master Node Group. The security group name will be added to the emr-template.json. priv-master-sg
    EMR_SLAVE_SG_NAME If EMR_CREATE_SG is true, set this property. Security Group Name for EMR Slave Node Group. The security group name will be added to the emr-template.json. priv-slave-sg
    EMR_SERVICE_ACCESS_SG_NAME If EMR_CREATE_SG is true, set this property. Security Group Name for EMR ServiceAccessSecurity. The security group name will be added to the emr-template.json. Fill this property only if you are creating EMR in a Private Network. priv-private-sg
    EMR_SUBNET_ID Subnet ID  
    EMR_KEYPAIR An existing EC2 key pair to SSH into the master node of the cluster. privacera-test-pair
    EMR_EC2_MARKET_TYPE Set market type as SPOT or ON_DEMAND. SPOT
    EMR_EC2_INSTANCE_TYPE Set the instance type. Instances can be of different types such as m5.xlarge, r5.xlarge and so on. m5.large
    EMR_MASTER_NODE_COUNT Node count for Master. The number of nodes can be 1, 2 and so on. 1
    EMR_CORE_NODE_COUNT Node count for Core. The number of cores can be 1, 2 and so on. 1
    EMR_VERSION Version of EMR. emr-x.xx.x
    EMR_EC2_DOMAIN Domain used by the nodes. It depends on EMR Region, for example, ".ec2.internal" is for us-east-1. .ec2.internal
    EMR_TERMINATION_PROTECT Set to enable/disable termination protection. true
    EMR_LOGS_PATH S3 location for storing EMR logs. s3://privacera-logs-bucket/
    EMR_KERBEROS_ENABLE Set to true if you want to enable kerberization on EMR. false
    EMR_KDC_ADMIN_PASSWORD If EMR_KERBEROS_ENABLE is true, set this property. The password used within the cluster for the kadmin service.  
    EMR_CROSS_REALM_PASSWORD If EMR_KERBEROS_ENABLE is true, set this property. The cross-realm trust principal password, which must be identical across realms.  
    EMR_SECURITY_CONFIG Name of the Security Configurations created for EMR. This can be a pre-created configuration, or Privacera Manager can generate a template through which you can create this configuration.  
    EMR_KERB_TICKET_LIFETIME Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The period for which a Kerberos ticket issued by the cluster’s KDC is valid. Cluster applications and services auto-renew tickets after they expire. EMR_KERB_TICKET_LIFETIME: 24
    EMR_KERB_REALM Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The Kerberos realm name for the other realm in the trust relationship.  
    EMR_KERB_DOMAIN Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The domain name of the other realm in the trust relationship.  
    EMR_KERB_ADMIN_SERVER Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The fully qualified domain name (FQDN) and an optional port for the Kerberos admin server in the other realm. If a port is not specified, 749 is used.  
    EMR_KERB_KDC_SERVER Set this property if you want Privacera Manager to create CF template for creating security configuration and EMR_KERBEROS_ENABLE is true. The fully qualified domain name (FQDN) and an optional port for the KDC in the other realm. If a port is not specified, 88 is used.  
    EMR_AWS_ACCT_ID AWS Account ID where EMR Cluster resides 9999999
    EMR_DEFAULT_ROLE Default role attached to EMR Cluster for performing cluster-related activities. This should be a pre-created role. EMR_DefaultRole
    EMR_ROLE_FOR_CLUSTER_NODES The IAM Role will be attached to each node in the EMR Cluster.
    This should have only minimal permissions for downloading the privacera_cust_conf.zip and basic EMR capabilities. It can be an existing one, if not, you can use the IAM role CF template to generate it after the Privacera Manager update.
    restricted_node_role
    EMR_USE_SINGLE_ROLE_FOR_APPS If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. Create a Single IAM Role that will be used by All EMR Applications. true
    EMR_ROLE_FOR_APPS If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. IAM Role name which will be used by all EMR Apps app_data_access_role
    EMR_ROLE_FOR_SPARK If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. Create multiple IAM Roles to be used by specific applications. Set EMR_USE_SINGLE_ROLE_FOR_APPS to be false. IAM Role name which will be used by Spark Application (Dataserver) for data access. spark_data_access_role
    EMR_ROLE_FOR_HIVE If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. IAM Role name which will be used by Hive Application for data access. hive_data_access_role
    EMR_ROLE_FOR_PRESTO If you want Privacera Manager to generate a CF template for IAM roles configuration, set this property. IAM Role name which will be used by Presto Application for data access. presto_data_access_role
    EMR_HIVE_METASTORE Metastore type. e.g. "glue", "hive" (For external hive-metastore) glue
    EMR_HIVE_METASTORE_PATH S3 location for hive metastore s3://hive-warehouse
    EMR_HIVE_METASTORE_CONNECTION_URL If EMR_HIVE_METASTORE is hive, set this property. JDBC Connection URL for connecting to hive. jdbc:mysql://<jdbc-host>:3306/<hive-db-name>?createDatabaseIfNotExist=true
    EMR_HIVE_METASTORE_CONNECTION_DRIVER If EMR_HIVE_METASTORE is hive, set this property. JDBC Driver Name org.mariadb.jdbc.Driver
    EMR_HIVE_METASTORE_CONNECTION_USERNAME If EMR_HIVE_METASTORE is hive, set this property. JDBC UserName hive
    EMR_HIVE_METASTORE_CONNECTION_PASSWORD If EMR_HIVE_METASTORE is hive, set this property. JDBC Password StRong@PassWord
    EMR_APP_SPARK_OLAC_ENABLE

    To install Spark application with Privacera plugin, set the property to true. OLAC is known as Object Level Access Control.

    Note:

    • Recommended when complete access control on the objects in AWS S3 is required.
    • When the property is set to true, s3 and s3n protocols will not be supported on EMR clusters while running Spark queries.

    true
    EMR_APP_SPARK_FGAC_ENABLE

    To install Spark application with Privacera plugin, set the property to true. FGAC is known as Fine Grained Access Control for Table and Column.

    Note: Recommended for compliance purposes, since the whole cluster will still have direct access to AWS S3 data.

    false
    EMR_APP_PRESTO_DB_ENABLE

    To install PrestoDB application with Privacera plugin, set the property to true.

    PrestoDB and PrestoSQL are mutually exclusive. Only one should be enabled at a time.

    true
    EMR_APP_PRESTO_SQL_ENABLE

    To install PrestoSQL application with Privacera plugin, set the property to true.

    PrestoDB and PrestoSQL are mutually exclusive. Only one should be enabled at a time.

    Note: PrestoSQL is supported for EMR versions 6.1.0 and higher.

    false
    EMR_APP_HIVE_ENABLE To install Hive application with Privacera plugin, set the property to true. true
    EMR_APP_ZEPPELIN_ENABLE To install Zeppelin application, set the property to true. true
    EMR_APP_LIVY_ENABLE To install Livy application, set the property to true. true
    EMR_CUST_CONF_ZIP_PATH A path where the privacera_cust_conf.zip file will be placed should be added. Privacera Manager will generate a privacera_cust_conf.zip under ~/privacera/privacera-manager/output/emr folder. This privacera_cust_conf.zip needs to be placed at an s3 or any https location from which the EMR cluster can download it. s3://privacera-artifacts/
    EMR_SPARK_ENABLE_VIEW_LEVEL_ACCESS_CONTROL

    Set the property to true to enable view-level column masking and row filter for SparkSQL. The property can be used only when you set EMR_APP_SPARK_FGAC_ENABLE to true.

    To learn how to use view-level access control in Spark, click here.

    false

    Note

    If your cluster was running while External Hive Metastore was down, and you are unable to connect to it, then restart the following three servers:

    sudo systemctl restart hive-hcatalog-server
    sudo systemctl restart hive-server2
    sudo systemctl restart presto-server
    
  4. Run the following commands.

    cd ~/privacera/privacera-manager 
    ./privacera-manager.sh update
    

    After the update is finished, all the cloud-formation JSON template files and privacera_cust_conf.zip will be available at the path, ~/privacera/privacera-manager/output/emr.

  5. Configure and run the following in AWS instance where Privacera is installed.

    1. (Optional) Create IAM roles using the emr-roles-creation-template.json template. Run the following command.

      aws --region <AWS-REGION> cloudformation create-stack --stack-name privacera-emr-role-creation --template-body file://emr-roles-creation-template.json --capabilities CAPABILITY_NAMED_IAM
      

      Note

      This will create IAM roles with minimal permissions. You can add bucket permissions into respective IAM roles as per your requirements.

    2. (Optional) Create Security Configurations using the emr-security-config-template.json template. Run the following command.

      aws --region <AWS-REGION> cloudformation create-stack --stack-name privacera-emr-security-config-creation  --template-body file://emr-security-config-template.json
      
    3. Copy the privacera_cust_conf.zip to the location specified in EMR_CUST_CONF_ZIP_PATH.

    4. Create EMR using the emr-template.json template. Run the following command.

      aws --region <AWS-REGION> cloudformation create-stack --stack-name privacera-emr-creation  --template-body file://emr-template.json
      

Note

  • For PrestoDB, secrets encryption of Solr authentication password is not supported. However, the properties file where the password resides is accessible only to the presto service user, hence it is invulnerable.

  • If your cluster was running while External Hive Metastore was down, and you are unable to connect to it, then restart the following three servers:

    sudo systemctl restart hive-hcatalog-server
    sudo systemctl restart hive-server2
    sudo systemctl restart presto-server
    

Last update: August 27, 2021