Removing duplicates from lists

11 minutes 2266 words #programming #python


In programming, we often need to handle removing duplicates from an array-like structure. While this may seem like an arbitrary and easy question, that can be solved relatively simply, that’s not necessarily always the case. While it may be easy to solve for the obvious case of things like a list of numbers, it’s not so easy if we need to deal with non-hashable objects, while keeping the algorithm runtime relatively low.

For this post, I’ll be using python for the explanation and examples, but this will apply to every programming language, it will just follow a different syntax.

# Removing hashable duplicates (Simple case)

Let’s say we have a list like [1, 2, 4, 4, 5, 1, 3], the easiest and most optimized way to remove all elements from this list would be to simply convert it into a set. In python, set is a data-structure that acts a bit like a mapping/dictionary. Sets rely on hashes, and since the same objects should hash to same values, by design, sets prevent the possibility of having multiple same elements in them.

An example of this would be as simple as this:

duplicate_list = [1, 2, 4, 4, 5, 1, 3]
result = list(set(duplicate_list))

This will result in a clean list without any duplicate values. By converting the list into a set, we need to loop over it (done internally) which means we take O(n), there’s also another O(n) for conversion back to list, but O(2n) is just O(n). This makes hashing a very compelling solution.

However this solution does have it’s issues, and that’s specifically the fact that it can’t be used for everything. Specifically if we had a list such as [[1, 2, 3], [1, 2, 3]] we would have a pretty big problem, that is the fact that the inner lists aren’t hashable, so we can’t use sets here. This would also happen with inner sets or any custom unhashable types.

# Removing unhashable duplicates

To cope with this problem, we have 5 general options:

  1. Do == (__eq__) comparisons for each value against each other value
  2. Compare exact object
  3. Convert the elements into their hashable counterpart
  4. Generate our own universal hashes for our unhashables
  5. Generate specific hashes depending on the objects

Note about hashes: even though in the last 3 options, what we’re essentially doing is somehow making an unhashable object a hashable one, while this is probably fine for our use-case, there usually is a reason why some object isn’t hashable (unless the object’s author was just careless). This reason is usually simply because the object is mutable, i.e.: it can be changed without making a new object. This mutability will mean that it will get a new hash, since the hash was probably computed from the defined values which were later changed.

# Equality comparisons with __eq__ (the simpler option)

The easiest way would be to simply rely on __eq__ (i.e.: the special function called by python when performing == comparisons). Doing this means we don’t need hashable object (objects defining __hash__), however we now need the objects to define this equality check. This isn’t a huge issue though, since it is much more common for objects to define equality checks than to be hashable. At least for builtin objects, all of the unhashable objects do define this.

We can tackle this simply by making another result list, that will only hold the values once, this is the list against which we will be running the == comparisons (for each value in it), to avoid ever inserting a duplicate value into it. Example of this code:

duplicate_list = [1, 1, 2, 4, [1, 2], [2, 1], [1, 2], "foo", "bar"]
result = []
for element in duplicate_list:
    # Go through the whole `result` list and only add current
    # `element` if it isn't already there.
    if element not in result:
        result.append(element)

Using element in result will internally call the __in__ method of our result list object, which is implemented in a way that goes through every element of that list and performs an __eq__ equality check, which if true, we return early and end with true. But in the worst case, we go through every element of the list and perform this comparison for each one, once we’ve depleted all elements, only then we can return false (element wasn’t in the result list)

This is a very common and simple solution, however it isn’t a very effective one. In worst case (every element is unique), it will first go through the whole duplicate_list (that’s O(n)), and in there, we will go through the whole result list, which in the worst case will be growing with every iteration until it gets to the size of n (since in the worst case, every element is unique and gets added). This results in another O(n) inside of our original O(n), making our algorithm O(n^2).

To resolve the efficiency problem, we could however add an optimization, where all hashable values would use set, and only resorting to __eq__ check for non-hashables. This would still result in worst case O(n^2) (all elements are non-hashable), but best case of O(n) (all elements are hashable). This approach is probably the best way to achieve simple implementation, yet relatively good efficiency.

# Compare exact objects

In most programming languages, there is a way to access the memory id of an object, in python, this is done with id() function. This memory id will always be unique, since we can’t have 2 separate objects sharing the same memory id. Essentially this is the difference between x is y and x == y, is will basically do id(x) == id(y) while == will use the x.__eq__(y) method defined for x’s class.

We can use this id for our comparisons instead of using the __eq__ and since it will always be unique, we can even use it as a hash of that object. However we should know that this will result in some weirdness and this method likely isn’t what you’d want for general duplicate removing.

a = "hello"
b = a  # Python will not make a new string object here, so it will use the same one

print(a == b) # <- True
print(a is b) # <- True

c = "hi"
d = "hi" # This is a new string object, and the ids will be different here

print(c == d) # <- True
print(c is d) # <- False

Note: In many python interpretations, c is d will actually also be True because of internal memory optimizations, it is simply more efficient to use the same object if there already exists one, but this shouldn’t be relied on and with some interpreters, this simply won’t be the case and we would actually get the shown False.

The algorithm like this is relatively simple:

x = Foo(x=5)
y = Foo(x=5)
duplicate_list = [x, x, y, 1, 2, "hi", Foo(x=5)]

result = {}
for element in duplicate_list:
    result[id(element)] = element

return list(result.values())

To preserve the original elements, we used a dict that held the unique hashable memory ids of our objects as keys and the actual unhashable objects as values. Once we were done, we just returned all of the values in it as a list.

The result of this would be: [x, y, 1, 2, "hi", Foo(x=5)]. (Note that x, y and Foo(x=5) would actually be printed in the same way, since they’re the same class, sharing the same __repr__). From this output we can clearly see that even though x, y, and Foo(x=5) are exactly the same thing, sharing the same attributes, they’re different objects and therefore they have different memory ids, which means our algorithm didn’t remove them, however there is now only one x, because the second one was indeed exactly the same object, so that did get removed.

# Convert hashables into their unhashable counterpart (case-by-case)

We could try and find some alternative hashable data-structure for our unhashable ones. With builtin data-structures, this is completely doable:

  • unhashable list can be fully represented by a hashable immutable tuple: Tuples are ordered and a hash of (1, 2) will be different from (2, 1), just like it should be with lists. Since tuples aren’t mutable, they can be hashable.
  • unhashable set can be fully represented by a hashable immutable frozenset: This is a python built-in, even though it’s not very well known, it’s simply an immutable set, giving it the possibility to be hashable too.

However we can’t simply represent these objects as these other data-structures, since this would cause collisions, for example with a list like [[1, 2], (1, 2)], since [1, 2] would get converted into a tuple of (1, 2) it now becomes identical, even though it really shouldn’t be. To fix this, we can simply represent each object with a tuple: (type(obj), hashable_counterpart(obj)), for example for a list, this would look like (list, tuple(obj)) with the first element in this tuple being the actual type, with this, real tuples would simply be (tuple, obj), making conversion possible once we’re finished.

Even though for built-in data types, this is a viable solution, we need to be sure that we really will only use these built-ins, and that our list will never contain anything else, or if it will, that it’s handled appropriately. Since the code for this example would be relatively long and entirely dependent on what data types will we have in the list, I won’t write an example code here, but the description should sufficiently describe how the code would look.

This will result in O(n), however it should be kept in mind that the conversion into the hashable counterparts could take significant time, we will be creating new objects for each unhashable object, which could end up taking longer than the O(n^2) method using simple __eq__ comparisons. This will be entirely dependent on how will the individual cases be handled.

# Generate universal custom hashes for unhashables

If we don’t have a strictly defined set of data structures that can be present in our list, we need to handle every possibility. We can simply device our own way to generate a unique hash for any object whatsoever, this may sound hard, but it’s not actually that complicated.

We can go about this in many ways, but as a simple proof that this is possible, we know that all objects must be stored somehow, this happens in the memory (RAM) which is a device that can only hold bits. This means that our object is stored somewhere in the memory as an array of bytes, and any array of bytes can be interpreted as a simple integer. Since integers are hashable we’re done, we can compute a hash of any object.

Python has a handy library called pickle that is used for data serialization (converting data into a stream of bytes that can be written into a file, for storage or sending whole python objects). Since these streams of bytes are hashable, we can simply pickle all of the objects and use their hashes.

Again though, just like with the previous case, we should keep in mind that this process can take a while. Especially when using pickle for objects with a big amount of internal attributes, not to mention that these attributes may also be holders of a wide amount of other attributes. This could end up to be a very intensive process, depending on the individual objects. So even with O(n) benefit of using hashing, this could take longer than the O(n^2) equality approach.

# Generate hashes specific to our objects (case-by-case)

Another approach we could take is similar to the above, but rather than using pickle and generating a hash from every single aspect of our object, we could use a more specific approach and only use the needed aspects of these objects to make the hash. This is perhaps a bit more similar to the case-by-case representation with a hashable counterpart.

For example for an object that holds many attributes, however the only attributes that could actually be changed and that determines it’s uniqueness, we can construct our hash purely from those. hash((obj.attribute1, obj.attribute2)) Ideally we would also include the object’s class: hash((obj.__class__, ...)).

This will end up being a lot faster than having to do a full serialization of our object with pickle, however since this is a case-by-case approach, it will again only work when we have a pre-determined set of classes that can be in our list, it isn’t applicable to everything since we don’t know the unique attributes of every custom class, that’s why these custom classes should be defining their own __hash__. Again, since this is hashing comparisons, it will be O(n).

# Recap

In most of the cases, the first approach will work just fine (the simple case where our list only have hashable objects).

If however if we do need to deal with unhashables, we can either use == comparisons, or even direct memory ids and ensure that our objects are exactly the same, or we can use some hashable counterpart to our unhashable object that could represent it. We could also construct a universal hash function with serialization of our object, or a specific one if we know which classes will be used there.

Even though we do have ways to deal with unhashables, if you’re in control of the classes, and they aren’t supposed to be mutable, always make sure to add a __hash__ method to them, so that duplicates can be easily removed in O(n) without any complicated inconveniences.