Skip to main content

Privacera Documentation

Table of Contents

Classifications using random sampling on PrivaceraCloud

By default PrivaceraCloud scans at "shallow depth" of a database. That is, for performance, the system examines the records of the database to derive the classifications.

This default assumes that the database itself is uniform with normalized data and the records are accurately represented.

However, with some unnormalized databases, this uniformity might be lacking.

If you suspect that your database values are not uniform, you can configure PrivaceraCloud to take a random sample from the entire database for analysis in classification.

One purpose of random sampling is to help isolate these data variations to eliminate them.

Supported JDBC applications for random sampling

Random sampling is supported for the following applications:

  • MySQL

  • Oracle

  • Trino

If you configure random sampling for any other database, it is ignored.

Prerequisites for random sampling

  • Know the names of the applications you want to randomly sample.

  • Be sure to have the JDBC connection details for those applications.

  • To minimize performance impact, determine if your database can be considered "large". By default, PrivaceraCloud considers any database with 10,000 records or more to be large. In this case, the random sampling is based on a subset of the data.

Define datasource (application) and configure random sampling

Random sampling is part of configuring a datasource. For details on setup, see Applications.

Enable random sampling

To enable random sampling for a database:

  1. Go to Settings > Applications.

  2. Under Connected Applications, click the name of the application.

  3. On the BASIC tab, Click the toggle jdbc.random.record.fetching.

  4. If your database has more than 10,000 records, specify the approximate number in the rows.as.small.dataset field.

Effects of random sampling

Random sampling has some visible effects.

Performance impact

You might perceive a delay in the running of random samples. Performance times can increase depending on the size of the sample.

Variations in classifications

You should not expect the same classification results for the same database from random sample to random sample.

Each random sample operates on a subset of the data. Depending on variations in the sampling of values in the database, the results of classification can vary.

Each random sample is unique. The records are selected randomly and so results vary from sample to sample.

For example, suppose an EMAIL column does not have consistent values:

  1. Sometimes, a delimiter that distinguishes a first and a last name with @-sign indicating Internet domain.

    A random sampling of such records can result in a consistent classification as PERSON NAME.

  2. Sometimes a bare username with no delimited last name and with no @-sign at all.

    The inconsistent variation in the data makes a concrete classification difficult to derive.

This same inherent inconsistency in the column values can result in variations of classification from run to run, each with its own unique random sampling.