Skip to content

Feat: Fuzzy match

Vincent Viers requested to merge fuzzy-match into main

Using the Levenshtein distance (aka edit distance) to fuzzy-match misspelled départements in source data (e.g. "Haut-de-Seine" or "Réunion" instead of "Hauts-de-Seine" and "La Réunion").

This is a very basic implementation that does the job as far as my use case is concerned. There's room for improvement, including :

  • use the distance in proportion to the size of the bigger string being compared to avoid false positive such as "paca" and "aura" which have an edit distance of 3 despite being totally different regions. I am not too concerned about this as we typically handle abbreviations as explicit cases in insitu/importer/validators.py.
  • allow for specifying a per-dataset max_distance value in config file (3 seems like a pretty arbitrary choice although it does the job as far as my use-cases are concerned)
  • use ngrams as in addok https://github.com/addok/addok/blob/master/addok/helpers/text.py#L164-L177

I'm thinking we figure it out as we walk, unless one of you really has an issue with the current implementation ?

Merge request reports