normalising names

Information to help in normalising names

This is an attempt to aid in normalising names, by removing some special characters which are very hard to see or are not actually visible. This is not going to be complete list as there are countless languages. However, this information can help in narrowing names down to avoid duplications.

Filter List – These characters should be removed or ignored when storing or checking for duplicates.

Substitution List – These characters should be substituted when storing or checking for duplicates.

Substituton (optional) – In some cases removal of diacritical marks might be necessary. There are some rules to be followed. Javascript: str.normalize("NFD").replace(/\p{Diacritic}/gu, "") seem to do this internally. Other programming languages might need to look up a pre-defined list.

Filter List (v. 20211007-01) [unicode]
/=================================================\
|code| UTF-8  | HTML  |          info             |
|----|--------|-------|---------------------------|
|    |        |       |*ALL FROM U+0000 TO U+001F |
|    |        |       |                           |
|007F|      7F|       |DELETE                     |
|00AD|   C2 AD|­  |SOFT HYPHEN                |
|200B|E2 80 8B|       |ZERO WIDTH SPACE           |
|200C|E2 80 8C|‌ |ZERO WIDTH NON-JOINER      |
|200D|E2 80 8D|‍  |ZERO WIDTH JOINER          |
|3000|E3 80 80|       |IDEOGRAPHIC SPACE          |
|FEFF|EF BB BF|       |ZERO WIDTH NO-BREAK SPACE  |
\=================================================/




Substitution List (v. 20211007-01) [unicode]
/===================================================================================================\
|                    FROM                         |                     TO                          |
|----|--------|-------|---------------------------|----|--------|-------|---------------------------|
|code| UTF-8  | HTML  |           info            |code| UTF-8  | HTML  |           info            |
|----|--------|-------|---------------------------|----|--------|-------|---------------------------|
|    |        |       |                           |    |        |       |                           |
|00A0|   C2 A0|  |NO-BREAK SPACE             |0020|      20|       |SPACE                      |
|2013|E2 80 93|–|EN DASH                    |002D|      2D|       |HYPHEN-MINUS               |
|    |        |       |                           |    |        |       |                           |
\===================================================================================================/

keyword: normalising, normalizing, normalisation, normalization, normalise, normalize

Supplementary reading: https://betterexplained.com/articles/unicode/ http://www.unicode.org/Public/UNIDATA/PropList.txt

(c) Ram Narula You can use this information, do give credit: github rambkk – Ram Narula – pluslab.net

github rambkk normalising-names

Author: Ram Narula – github rambkk – <pluslab.net>