NAME
    Lingua::NameUtils - Identify given/family names and capitalize correctly

SYNOPSIS
     use 5.014; # or later

     use Lingua::NameUtils qw(
         namecase gnamecase fnamecase namecase_exception
         namesplit nameparts namesplit_exception
         nametrim normalize
     );

     use Lingua::NameUtils ':all';   # All of the above functions
     use Lingua::NameUtils ':case';  # The case functions and normalize
     use Lingua::NameUtils ':split'; # The split functions and normalize

     # Case functions

     $full_name = namecase($full_name);
     $given_names = gnamecase($given_names); # i.e. Given name(s) only
     $family_name = fnamecase($family_name); # i.e. Family name only
     $family_name = fnamecase($family_name, $given_names); # Individual exceptions

     namecase_exception("Fitzell"); # Add an exception for all members of a family
     namecase_exception(qw(DeVries DiFrancesco)); # Add more exceptions
     namecase_exception("Marrier D'Unienville, Jean"); # Add an individual exception

     # Split functions

     $full_name = namesplit($full_name); # Format as "Family_name, Given_names"

     ($family_name, $given_names) = nameparts($full_name); # Format as an array

     namesplit_exception("Bryant Smith, Denise"); # Multi-name family names

     # Trim/squeexe function

     $name = nametrim($name);

     # Unicode normalization of internal data

     use Unicode::Normalize qw(NFD);
     normalize(\&NFD);

DESCRIPTION
    This module is useful when receiving a person's name that might be all
    uppercase, or in the wrong case, or it might have the given names and
    the family name combined in a single string (e.g., a single spreadsheet
    column), and you need to split the full name into its parts, and you
    want to set the capitalization correctly so as to show each person a
    little respect by taking the trouble to at least try to get their name
    right.

    Getting the case right for people's names is difficult, and many
    software systems address this problem by not even trying, and using
    uppercase exclusively. It's ugly, but it's easy and consistent. We can
    do better. It can't be perfect, by default, but with ongoing adjustments
    to suit your evolving dataset, you can improve it to meet your needs.

    People with complex grammatical aristocratic/topographic/patronymic
    family names often don't know how their own names should be capitalized.
    Or at least, they don't know how their own ancestors capitalized their
    name, or they know, but they disagree with it. Some people insist on
    having it their own way, and that's fine. This module, by default,
    prefers how their ancestors would have capitalized their names, but
    people can do whatever they want to their own names, and it's important
    to them, so this module supports general exceptions that apply to
    everyone with a particular family name, for when the default behaviour
    is definitely wrong, and it also supports exceptions that apply only to
    individuals who report that it is wrong for them.

    Note: Apart from Chinese, Japanese, and Korean family names, this module
    only understands names in Latin scripts (except perhaps by lucky
    accident: e.g., names in Cyrillic work), and it doesn't handle
    honorifics, titles, merged initials, or postnominals. It only handles
    names. But it does handle complex names coming from a variety of places
    (e.g., Europe, Middle East, Africa, East Asia, Pacifika, Americas). By
    default, it doesn't correctly identify unhyphenated multi-name family
    names (like Spanish names, unless "y" is present). It handles some mixed
    case names such as McAdam, MacArthur, FitzSimmons, DeVito, VanZandt,
    etc., but there will be false negatives (and arguably false positives)
    which can be corrected with case exceptions. Over time, you will build
    up a set of case exceptions and split exceptions that meets the needs of
    your dataset.

EXPORT TAGS
    This module doesn't export any function names by default. The following
    export tags are available for the "use" statement:

    ":all" - All functions
    ":case" - *namecase*, *gnamecase*, *fnamecase*, *namecase_exception*,
    *normalize*
    ":split" - *namesplit*, *nameparts*, *namesplit_exception*, *normalize*

FUNCTIONS
    $name = namecase([$name[, $part[, $given_names]]])
        Returns the supplied name with the capitalization fixed. See the
        EXAMPLES section below to see exactly what this means. This can be
        called in several ways:

        For a full name (implicitly):

         $name = namecase("JOHN PETER SMITH");
         $name = namecase("SMITH, JOHN PETER");

         $name = namecase; # Same as namecase($_)

        If not supplied, the $name argument defaults to $_.

        For a full name (explicitly):

         $name = namecase("JOHN PETER SMITH", 'full');
         $name = namecase("SMITH, JOHN PETER", 'full');

        Note that the full name can be supplied in the (ambiguous) natural
        order, with the given name(s) followed by the family name, or
        unambiguously, with the family name followed by a comma followed by
        the given name(s).

        For a given name or names by itself (same as *gnamecase()*, see
        below):

         $name = namecase("JOHN PETER", 'given');

        For a family name by itself (same as *fnamecase()*, see below):

         $name = namecase("SMITH", 'family');

        For a family name by itself when you have (or might one day have)
        any case exceptions intended to only affect a single individual
        (same as *fnamecase()*, see below):

         $name = namecase("SMITH", 'family', "JOHN PETER");

    $given_names = gnamecase([$given_names])
        Returns the supplied given name(s) with the capitalization fixed.
        Same as: "namecase($given_names, 'given')". Given names aren't
        capitalized in exactly the same way as family names.

        If not supplied, the $given_names argument defaults to $_.

    $family_name = fnamecase([$family_name[, $given_names]])
        Returns the supplied family name with the capitalization fixed. Same
        as: "namecase($family_name, 'family'[, $given_names])".

        If not supplied, the $family_name argument defaults to $_.

        The $given_names argument is technically optional, but it should be
        supplied, just in case you ever need case exceptions that only apply
        to an individual. This enables people with the same family name to
        have their use of that family name capitalized the way they want it
        to be. Once you have a need for such individual case exceptions, the
        $given_names argument will become necessary everywhere, so it's best
        to supply it from the start.

    namecase_exception($bespoke_capitalized_name, ...)
        Add one or more case exceptions. Whenever the above case functions
        subsequently capitalize the supplied name, the supplied
        capitalization will be returned, rather than the default behaviour.

        There are two kinds of case exception. Some apply to everyone that
        shares a family name, and some apply to an individual.

        Family-wide exceptions contain the family name, capitalized
        correctly:

         namecase_exception("DiBona");

        Individual case exceptions must be supplied as unambiguous full
        names in the form: *"Family_name, Given_names"*, capitalized as
        specified by the named person:

         namecase_exception("DiBona, John");

        Returns 1 if the exception was successfully added. Returns 0
        otherwise. The only reason for a failure is if the supplied
        exception is undefined or empty.

    $full_name = namesplit([$full_name])
        Returns the supplied full name converted to the unambiguous form:
        *"Family_name, Given_names"* with the capitalization fixed.

        If not supplied, the $full_name argument defaults to $_.

        The $full_name argument is expected to be in either form:
        *"Given_names Family_name"* or *"Family_name, Given_names"*. Note
        that a string is returned with the family name followed by a comma
        and space followed by the given name(s).

        Complex grammatical aristocratic/topographic/patronymic family names
        in Latin script are identified. See the EXAMPLES section below. But
        unhyphenated multi-name family names are not correctly identified by
        default. That requires split exceptions (see below). Spanish and
        Catalan multi-name family names are correctly identified when the
        two names are joined with "y" or "i", but when the joining word is
        not present, a split exception is needed.

        With Chinese, Japanese, and Korean names, the family name appears
        first when written in their own scripts/characters. When romanized,
        Chinese and Korean family names might appear first or last. The same
        is true for Vietnamese names.

        This module recognizes the 400 or so most common Chinese family
        names (97% of the population) in Chinese characters and in one
        romanized spelling, and additionally, the 100 most common Chinese
        family names (85% of the population) in pinyin and various other
        romanized spellings, as used in several countries. It also
        recognizes the 190 most common Korean names (98% of the population)
        in Hangul, Hanja, and romanized. It also recognizes the 209
        Vietnamese names (100% of the population, apparently). There are too
        many Japanese family names (over 300,000) to maintain a list of
        them, so this module delegates to *Lingua::JA::Name::Splitter* which
        employs a statistical method to identify Japanese family names
        written in Kanji and Kana. I don't know what proportion of Japanese
        family names it identifies.

    ($family_name, $given_names) = nameparts([$full_name])
        Returns the supplied full name converted to a two-element array
        containing the family name and the given name(s), with the
        capitalization fixed.

        If not supplied, the $full_name argument defaults to $_.

        The $full_name argument is expected to be in either form:
        *"Given_names Family_name"* or *"Family_name, Given_names"*.

        This function converts the corresponding return value of
        *namesplit()* into a two-item array. See *namesplit()* above for
        more details. If the name contains a single "word", then it isn't
        splittable, and so a one-element array is returned. If the name is
        the empty string or undefined, then a zero-element array is
        returned;

        Chinese, Japanese, and Korean names in their own scripts/characters
        contain multiple words even though they don't contain spaces between
        them. If a full name is supplied, this function will return a
        two-element array.

    namesplit_exception($full_name_in_comma_form, ...)
        Add one or more split exceptions. The exceptions must be supplied as
        full names in the unambiguous comma-separated form with the family
        name followed by a comma and space followed by the given name(s).

        This is needed to support unhyphenated multi-name family names that
        aren't automatically identified, such as *"Ah Mu, Corie"*, and even
        complex given names that would be misrecognized, such as
        *"de Sousa, Fatima de Gois"*.

        This is also needed to correct the situation when this module
        misidentifies the type of name, and splits it incorrectly. For
        example, a Japanese name with a family name consisting of two
        characters might be misidentified as a Chinese name with a family
        name consisting of one character.

        Returns 1 if the exception was successfully added. Returns 0
        otherwise. The only reason for a failure is if the supplied
        exception does not contain a comma.

    nametrim($name)
        Returns the supplied name (given, family, or full name), with any
        leading and trailing spaces removed, any run of multiple spaces
        replaced with a single space, any space before a comma-like
        character or hyphen-like character removed, and with a space added
        after any comma-like character, if one is not already present there.

        If not supplied, the $name argument defaults to $_.

    normalize($func)
        Normalize this module's internal data using the supplied Unicode
        normalization function reference so as to match your application's
        choice of normalization. A likely choice would be
        *Unicode::Normalize::NFD*.

        This is needed if the application's choice of Unicode normalization
        differs from whatever was used for the module's internal data in the
        module source code (i.e., NFC). A difference in normalization can
        lead to false negatives and incorrect results when matching names
        against internal data.

    kc($name) [Internal]
        This function is internal to the module and is never exported. It
        returns hash keys created from names for looking up internal data.
        It assumes that the supplied name is defined. It assumes that
        *nametrim()* has already processed the supplied name. It's like
        *fc()* (or *lc()* on *perl v5.14*) except that it also replaces
        non-ASCII apostrophe-like characters with the ASCII apostrophe
        character, and it replaces non-ASCII hyphen-like characters with the
        ASCII hyphen-minus character.

        It is documented here so as to satisfy *Pod::Coverage* which thinks
        this should be here. It would only be useful externally if the
        application had a hash keyed by people's names.

EXAMPLES
    These examples show the default *namecase()* output for various forms of
    names. They also show which name forms are automatically recognized by
    *namesplit()*. Note that non-ASCII letters and punctuation in these
    examples have been replaced with the closest ASCII equivalents to avoid
    problems with some implementations of **roff*. *namesplit()* also
    supports names in Chinese characters, Korean Hangul and Hanja, and
    Japanese Kanji and Kana, but they are not shown here for the same
    reason:

     John Peter Smith
     William Maitland of Lethington

     Shaun McAdam
     Fergus MacDonald
     Lachlan Macquarie
     James FitzPatrick
     Patrick O'Brian
     Kelly St Clair

     David Le Page
     Pierre La Tour
     Rochelle Li Donni
     Giovanni Lo Giudice
     Estella d'Iapico-Bien
     Bruno dall'Agnese
     Bruno dell'Agnese
     Lorenzo de' Medici
     John de Groot
     Pierre de la Pierre
     Maria del Mar
     Maria dela Mar
     Maria dels Angels
     Giaccomo della Vella
     Giovanni delle Velle
     Maria dal Santos
     Marco dalla Vella
     Lorenza degli Castelli
     Maria di Francesco
     Giuseppe Tomasi di Lampedusa
     Pierre du Page
     Jorge da Silva
     Filipe do Santo
     Abilio dos Santos
     Adriana das Costas
     Oscar San Jose
     Catalina Santa Gutierrez
     Monica Santos Bernal

     Pablo Diego Ruiz y Picasso
     Carles Puigdemont i Casamajo
     Joao Duarte da Silva dos Santos da Costa de Sousa
     Joao Duarte da Silva Santos Costa e Sousa

     Hans von Pappenhim
     Hans zu Pappenhim
     Hans von und zu Pappenhim

     Bram van Haag
     Jeroen der Haag
     Johanne ter Horst
     Sanne den Haag
     Laura van de Horst
     Eva van der Haag
     Willem van den Haag
     Mees van het Horst
     Henrik van Voorst tot Voorst 
     Willem 'sGravesande
     Gemeente van 'sHertogenbosch
     Gemeente van 'tHoen

     Sigurd av Morgenstierne
     Maja von Munthe af Morgenstierne
     Lars Jonsson til Sudreim

     James DaSilva
     Jack DuBois
     Daniel LaForge
     Sally LeFevre
     Kristine VanZandt

     Patrick O Donoghue
     Micheal O hAodha
     Saoirse Ni Fhoghlua
     Michael Mac Donnchada
     Saoirse Nic Fhoghlua
     Michael Ua Donoghue
     Aisling Bean Ui Fhoghlua
     Saoirse Bean Mhic Fhoghlua
     Saoirse Ui Fhoghlua
     Saoirse Mhic Fhoghlua

     Rhys ap Dafydd
     Maredudd ab Owain
     Myfanwy ferch Maredudd
     Myfanwy verch Maredudd

     Camilla El Ali
     Mariam Al Musawi
     Bazif el-Bayeh
     Nariman al-Nassar
     Hizb ut-Tahrir
     Aziz ibn Hab
     Charbel bin Hab
     Angela bint Aziz
     Fatima binti Aziz
     Nadia binte Aziz
     David Ben Joseph          # Incorrect when technically ambiguous
     ben Joseph, David         # Correct when technically unambiguous
     David ben Joseph v'Rachel # Correct - this is really not ambiguous
     Leah bat Moshe
     Leah bat Moshe v'Rachel ha-Rav
     Devorah Rut bat Mordecai v' Tzipporah
     Leah mibeit Moshe v'Rachel ha-Levi
     Leah mimishpachat Moshe v'Rachel ha-Kohein

     Natalie Te Whare

     Ayize ka Nolwazi

     Oso'ese
     Ya'akov
     Y'honatan
     Sh'mu'el
     Onosa'i
     Tausa'afia
     Ka'ana'ana
     S'thembiso

LIMITATIONS
    It's impossible to actually do what this module attempts to do in a way
    that works correctly for everybody by default. There are too many people
    who want their names cased incorrectly (e.g., Da Vinci rather than
    da Vinci), and too many unhyphenated multi-name family names, and so
    many languages. This module handles complex grammatical
    aristocratic/topographic/patronymic (romanized) family names from
    various languages (e.g., French, Italian, Spanish, Catalan, Portuguese,
    English, Irish, Welsh, Scottish, German, Dutch, Swedish, Norwegian,
    Danish, Finnish, Zulu, Arabic, Hebrew), but there are many more
    languages that it doesn't know about. So, in order to keep all of your
    users happy, you will almost certainly need to build up your own list of
    case and split exceptions in a file or database, and have your
    application load them before processing any names. But if two people
    with exactly the same full name both insist on having their name
    capitalized differently to each other, that's not supported.

    Different languages can have different case conventions for the same
    "word". For example, a Greek family name can start with *el*, but a
    Spanish family name can start with *El*. This module favours the most
    likely case (i.e., Spanish in this example). For other cases, this can
    be corrected with case exceptions.

    Similarly, a Hebrew patronymic name can start with *ben*, but an Anglo
    given name can be *Ben*. That's fine if *fnamecase()* or *gnamecase()*
    are used, supplied with just the given name(s) or the family name,
    respectively, but *namecase()* supplied with an ambiguous full name will
    favour the Anglo interpretation. That is, unless the name contains other
    elements that make it obviously Hebrew, such as a matronymic component
    (e.g., v'Rachel), or a suffix such as ha-Rav. That will cause a
    technically ambiguous *ben* to be correctly identified as a patronymic
    prefix.

    Similarly, when it comes to splitting/identifying the given names and
    the family name within a full name with *namesplit()* or *nameparts()*,
    the word *ben* is not interpreted as the start of a patronymic name (in
    the absence of other clues as indicated above), because *Ben* is more
    likely to be an Anglo middle name (although *bat* is always interpreted
    as the start of a patronymic name). Luckily, Hebrew names aren't used
    much outside of religious contexts, so this hopefully won't be much of a
    problem for this module. If it is, it can be corrected with split
    exceptions (or with more detailed Hebrew names).

    Romanized Chinese, Korean and Vietnamese family names can appear at the
    start or the end of a full name. This module detects them in either
    format. But there can be false positives when a given name looks the
    same as a romanized CKV family name. For example, *namesplit()* works
    better for Korean names where the family name appears at the start
    rather than at the end, because some Korean given names look like a
    family name. Other odd cases might arise due to not knowing which
    language a romanized name is from. But split exceptions should help when
    these cases are noticed.

CAVEAT
    Unicode strings are complicated. Some graphemes can occur in multiple
    ways. Any case and split exceptions are looked up via a hash key match.
    To increase the chance of matches succeeding when they should, you
    should probably normalize strings on input to your application using
    something like *Unicode::Normalize::NFD* (or maybe even
    *Unicode::Stringprep*) before passing names to this module which assumes
    that any necessary preparation has already been done. If necessary, you
    can normalize this module's internal data (with *normalize()*) to match
    your application's choice of normalization.

    Note: This module does also work with strings in non-utf8 source code.
    It does not require utf8 source code. But it does require *perl v5.14*
    or later.

BUGS
    The *nameparts()* function probably should have been called
    *namesplit()*, because it returns an array, and the *namesplit()*
    function probably should have been called something else, because it
    returns a string. But they are the names I'm used to, and I couldn't
    think of anything better, and now it's too late to change it.

    Space characters are not preserved. Spaces at the start or end of a name
    are removed, as are spaces before commas, and before and after hyphens.
    There will always be a space after a comma. Any non-ASCII spaces are
    replaced with ASCII space. Let me know if that's a problem. It can
    probably be fixed, but I think it's a feature. If it's any consolation,
    non-ASCII apostrophe-like characters and hyphen-like characters are
    preserved. But if there is a case exception involving any
    apostrophe-like or hyphen-like characters, then they too are replaced by
    the actual character specified in the exception.

HISTORY
    A (less comprehensive) version of this module (in another language) has
    been in use for over fifteen years at a small company with a dataset of
    about fifty thousand names. With that dataset, six generic case
    exceptions were needed, two individual case exceptions, and about a
    thousand split exceptions.

    It enabled the accurate identification of names in spreadsheets so as to
    check against ID number columns, and it made reports containing people's
    names much prettier than they would otherwise have been.

SEE ALSO
    Lingua::EN::TitleParse, Lingua::EN::NameCase, Lingua::EN::NameParse,
    Lingua::JA::Name::Splitter, String::ProperCase::Surname,
    Unicode::Normalize::NFD, Unicode::Stringprep.

AUTHOR
    20230630 raf <raf@raf.org>

COPYRIGHT AND LICENSE
    Copyright (C) 2023 raf <raf@raf.org>

    This is free software; you can redistribute it and/or modify it under
    the same terms as the Perl 5 programming language system itself.