Analyzing posts and finding duplicates
Posted on
As the number of posts in db grew, so has one particular issue: the presence of duplicates. People would post an advert, and some time later, post another one, usually quite similar. This has made the search function more and more cumbersome to use.
In this post, I will present the research I led to find duplicates. But before that, I will also give a quick presentation of the adverts posted in /r/mechmarket.
Adverts
Here’s the histogram, and below the pie chart, of the adverts posted in the subreddit, with a breakdown by advert type. The data goes from May 2021, which means about 9 months of data.
ad_type | Count |
---|---|
Selling | 26478 |
Sold | 6238 |
Buying | 15101 |
Purchased | 1187 |
About 1 third of the adverts are Buying / Purchased, which is somewhat higher than I expected. We can also notice that the proportion Purchased / Bought (about 1 / 15) is much lower than Sold / Selling (about 1/5).
We can also notice that many adverts are never set to “Sold” (or “Purchased”): either they are not sold, or the status are not updated by the authors (more on this at the end of this post).
Duplicates
Back on the topic. Let’s start by trying determine an upper bound on the number of duplicate adverts in the database. From now on, we will look only at the “Selling” posts.
Same author adverts
One very simple heuristic to determine duplicate adverts would be to look at the number of times a user posted an advert.
We can start by plotting the histogram of number of author by number of posts. Each bar in this histogram indicates the number of authors which have posted a given number of times.
For example,
- A little more than 4000 people have posted a single “Selling” advert.
- A little more than 1500 people have posted a two “Selling” advert.
- Some people have posted up to 70 adverts in 9 month: almost 2 adverts a week !
Note that this histogram does not give a very clear idea of the amount of adverts that might be duplicates. Let’s compute for each advert in the db whether it is the most recent post by it’s author. Let’s have a look at the graph below:
As a rough estimate, almost 4/5 of adverts are not the last one an author has made !
Better heuristics for duplicates
As we saw, the potential number of duplicates is huge. But it’s far from impossible that someone posted two different adverts for two different products; in this case, flagging the old one as duplicate would not be the right thing to do. Thus we need to find better heuristics.
For example, let’s look at the adverts posted by some user:
# | date | Offers |
---|---|---|
1 | 2021-08-17 | ai03 Vega, GMK Peach Blossom |
2 | 2021-09-19 | Switches |
3 | 2021-10-01 | Switches, Deskmats |
4 | 2021-10-11 | Switches |
It seems safe to assume that adverts #2 and #3 are duplicates of #4, while #1 is not. In order to identify such cases, the “Offers” field is broken into pieces by various separators (,
, |
, /
). Whenever one group is in common between two adverts, the oldest one is marked as duplicate.
In order to take into accounts some edgecases where the user has modified the description of the same item, we also use the https://github.com/seatgeek/thefuzz library; adverts where the “Offers” field has a high similarity will also be marked as duplicate.
One last heuristic we can use is to take into account the number of identical words (space separated) in two adverts. Although this might seem a good idea at first, it is a dangerous one, since it can easily lead to different adverts being flagged as similar. In practice, to be safe, the number of common words has to be quite high, which makes this measure less useful.
If you are interested in the exact implementation, you can find the related file in the fleebmarket repo (it's quite short, I promise !).
Results
With these heuristics at hand, I could mark about 2/5th of the adverts as duplicates. Not only does it help to provide better search results, it also reduces the database load !
is_duplicate | is_last_post | Count |
---|---|---|
No | No | 8787 |
No | Yes | 7297 |
Yes | No | 10407 |
Yes | Yes | 0 |
We can also look again at the number of authors by number of posts. In the graph below (log scale !), we compare the case with all posts are taken into account, vs the case where only non-duplicates are taken into account.
Remember this is log scale, so a small decrease is in fact a big difference.
Next steps
As stated at beginning of this article, many adverts are not marked as Sold. Often, only a subset of the items listed have indeed been sold, so this is legitimate, but still annoying.
In order to exclude those items from the search results, one way is to extract each item from the advert body, and try to guess if it has already been sold. The main algorithms to do this have been written; they are not perfect, but will hopefully provide an even better searching experience !