Skip to content

Import Data Deduplication

You’ve learned what happens behind the scenes when Measure IQ ingests your data. This document explains what safeguards are in place to ensure that Measure IQ does not ingest duplicate records.

Measure IQ has two layers of duplicate protection: file hashing and event checks.

First, Measure IQ evaluates every file that arrives in your cloud storage solution for its upload time, size, name, and other information. It then assigns it a hash. Measure IQ then begins ingesting this file into your data tier.

For ingest jobs that may revisit the same directory in your cloud storage solution over the course of days, your files are again evaluated and hashed. When a hash matches that of an existing file it is then ignored as a duplicate. If anything has changed about a previously downloaded file (upload time, size, arrangement of fields, column names etc.) a new hash will be generated. The file will be seen as new, and it will then be ingested.

For the second layer of deduplication, the data tier checks incoming events against stored events and discards events that are already stored. For a duplicate event to be dropped in this stage, the event must be exactly the same. All fields must be named the same, all properties must contain the exact same information, and all fields/properties must be in the same order.

File Hashing:

Measure IQ has a daily ingest job that has a 3-day look back, meaning it will scan its target directory structure for folders and files once a day for 3 days.

On Day one it encounters a file called muppets_6-23-2023.json that contains the following:

TimestampIDMuppet NameactionLanguagelocation
6/23/2023  1:01:01 AM1KermithostsEnglishmain stage
6/23/2023  1:10:01 AM2Animalplays drumsHowlingmain stage
6/23/2023  1:15:01 AM3Fozzy Beartells a jokeEnglishmain stage
6/23/2023  1:20:01 AM4Rolfplays pianoEnglishpiano bar
6/23/2023  1:05:01 AM5Chefthrows fishSwedishkitchen

A unique hash is created, logged and the file is ingested.

On Day two, ingest creates a hash for muppets_6-23-2023.json and compares it to the existing hash. The file is the same, the hashes are the same. Muppets_6-23-2023.json is ignored.

On Day three, ingest evaluates muppets6-23-2023.json. This time the file is:

TimestampIDMuppet NameactionLanguagelocation
6/23/2023  1:01:01 AM1KermithostsEnglishmain stage
6/23/2023  1:10:01 AM2Animalplays drumsHowlingmain stage
6/23/2023  1:15:01 AM3Fozzy Beartells a jokeEnglishmain stage
6/23/2023  1:20:01 AM4Rolfplays pianoEnglishpiano bar
6/23/2023  1:05:01 AM5Chefthrows fishSwedishkitchen
6/23/2023  1:25:01 AM6Gonzosoars through the airEnglishtrapeze
6/23/2023  1:30:01 AM4Rolftakes a napEnglishback stage

As another row has been added, the file size is different, the hash will be different and the file will be ingested.

Event Checking:

In this example, on the third day’s pass, only the new record will be ingested as event checking will evaluate each row for duplication, ignore the first 5 records, and ingest the last 2.

TimestampIDMuppet NameactionLanguagelocation
6/23/2023  1:25:01 AM6Gonzosoars through the airEnglishtrapeze
6/23/2023  1:30:01 AM4Rolftakes a napEnglishback stage