Skip to content

Random Sample Classification

By default PrivaceraCloud scans at "shallow depth" of a database. That is, for performance, the system examines only the first 500 records of the database to derive the classifications.

This default assumes that the database itself is uniform with normalized data that the first 500 records accurately represent.

However, with some unnormalized databases, this uniformity might be lacking.

If you suspect that your database values are not uniform, you can configure PrivaceraCloud to take a random sample from the entire database for its analysis in classification.

One purpose of random sampling is to help isolate these data variations to eliminate them.

Supported JDBC Databases#

Random sampling is supported for the following databases:

  • MySQL
  • Oracle
  • Trino

If you configure random sampling for any other database, it is ignored.

Prerequisite/Setup#

  • Know the names of the databases you want to randomly sample.
  • Be sure to have the JDBC connection details for those databases.
  • To minimize performance impact, determine if your database can be considered "large". By default, PrivaceraCloud considers any database with 10,000 records or more to be large. In this case, the random sampling is based on a subset of the data.

Define Datasource (Application)#

Random sampling is part of configuring a datasource. For details on setup, see Applications.

Enable Random Sampling#

To enable random sampling for a database, click the toggle jdbc.random.record.fetching.

If your database has more than 10,000 records, specify the approximate number in the rows.as.small.dataset field.

Effects of Random Sampling#

Random sampling has some visible effects.

Performance Impact#

You might perceive a delay in the running of random samples. Performance times can increase depending on the size of the sample.

Variations in Classifications#

You should not expect the same classification results for the same database from random sample to random sample.

Each random sample operates on a subset of the data. Depending on variations in the sampling of values in the database, the results of classification can vary.

Each random sample is unique. The records are selected randomly and so results vary from sample to smple.

For example, suppose an EMAIL column does not have consistent values:

  1. Sometimes, a delimiter that distinguishes a first and a last name with @-sign indicating Internet domain.

    A random sampling of such records can result in a consistent classification as PERSON NAME.

  2. Sometimes a bare username with no delimited last name and with no @-sign at all.

    The inconsistent variation in the data makes a concrete classification difficult to derive.

This same inherent inconsistency in the column values can result in variations of classification from run to run, each with its own unique random sampling.


Last update: February 26, 2022