How to find duplicate in json?
Duplications Checker is an Apify actor that helps you find duplicates in your datasets or JSON array.
This actor expects a JSON object as an input. You can also set it up in a visual UI editor on Apify. You can find examples in the Input and Example Run tabs of the actor page in Apify Store. All the input fields (regardless of section) are top level fields.
Main input fields
Show options
Dataset pagination options
Other data sources
preCheckFunction is useful to transform the input data before the actual check. Its main usefulness is to ensure that the field you are checking is a top level (not nested) field and that the value of that field is a simple value like number or string (The decision to not allow deep equality check for nested structures was made for simplicity and performance reasons).
So for example, let's say you have an item with a nested field images:
If you want to check the first image URL for duplications and keep the item url for a reference, you can easily transform the whole data with simple preCheckFunction:
Now, set field in input to imageUrl and all will work nicely.
At the end of the actor run, the report is saved to the default Key Value store as an OUTPUT. Also, if showItems is true, it will push duplicate items to the dataset.
By default, the report will include all information but you can opt-out if you set any of showIndexes, showItems, showMissing to false.
Report is an object where every field value that appeared at least twice (which means it was duplicate) is inluced as a key. For each of them, report contains count (minimum is 2), originalIndexes (which are indexes of items in your original dataset or after preCheckFunction) and outputIndexes (only present when showItems is enabled). The indexes should help you navigate the duplicates in your data.
The items are intentionally not included in the OUTPUT report to reduce its size. Instead they are pushed to the default dataset and you can locate them with outputIndexes. If you need to connect the OUTPUT with the dataset for deeper analysis, you can find the items with the help of indexes.
The first version of the actor had the option to check more fields at once but it produced very complicated output and the implementation was too convoluted so I decided to abandon the idea for simplicity. In case you want to check more fields, simply run it once for each field. Since the actor consumption is pretty low, it is not a big deal.
- Use a JSON parser that detects duplicates.
- Use a JSON validator that detects duplicates.
- Compare original request with JSON. stringify(JSON. parse(request. content), but original request may be a "pretty" version so simple string comparison won't work.