ChineseNames

Chinese Name Database 1930-2008

A database of Chinese surnames and Chinese given names (1930-2008). This database contains nationwide frequency statistics of 1,806 Chinese surnames and 2,614 Chinese characters used in given names, covering about 1.2 billion Han Chinese population (96.8% of the Han Chinese household-registered population born from 1930 to 2008 and still alive in 2008). This package also contains a function for computing multiple features of Chinese surnames and Chinese given names for scientific research (e.g., name uniqueness, name gender, name valence, and name warmth/competence).

CRAN-Version GitHub-Version R-CMD-check CRAN-Downloads Logo-Designer GitHub-Stars

Author

Han-Wu-Shuang (Bruce) Bao 包寒吴霜

📬 baohws@foxmail.com

📋 psychbruce.github.io

Citation

Installation

## Method 1: Install from CRAN
install.packages("ChineseNames")

## Method 2: Install from GitHub
install.packages("devtools")
devtools::install_github("psychbruce/ChineseNames")

User Guide

Data Source

This Chinese name database was provided by Beijing Meiming Science and Technology Company (in collaboration) and originally obtained from the National Citizen Identity Information Center (NCIIC) of China in 2008.

It contains nationwide frequency statistics of almost all Chinese surnames and given-name characters, which have covered about 1.2 billion Han Chinese population (96.8% of the Han Chinese population born from 1930 to 2008 and still alive in 2008, i.e., the living household-registered population). It also contains subjective rating indices of given-name characters. To our knowledge, this is the most comprehensive and accurate Chinese name database up to now.

Note that this database does not contain any individual-level information (so it does not leak personal privacy). All data are at the name level or character level. Extremely rare characters are not included.

Datasets

This package includes five datasets (data.frame in R). You can access them using the data() function in R. The use of these datasets should follow the GNU GPL-3 License and the Creative Commons License CC BY-NC-SA, with a proper citation of this package and only for non-commercial purposes.

Note. The “ppm” in the variable names of these datasets means “parts per million (百万分率)” (e.g., 1 ppm = a proportion of 1/106).

Compute Name Features

Use the compute_name_index() function. This function computes multiple indices of Chinese surnames and given names for scientific research. Just input a data frame with full names (and birth years, if necessary), then it returns a new data frame with all name indices appended.

Examples:

library(ChineseNames)
?compute_name_index  # see detailed usage in help page

## Usage 1
compute_name_index(name="包寒吴霜", birth=1995)

## Usage 2
demodata = data.frame(
  name = c("包寒吴霜", "陈俊霖", "张伟", "张炜", "欧阳修", "欧阳", "易烊千玺", "张艺谋", "王的"),
  birth = c(1995, 1995, 1985, 1988, 1968, 2009, 2000, 1950, 2005))

newdata = compute_name_index(
  demodata,
  var.fullname="name",  # full name
  var.birthyear="birth")  # adjusted for birth year
View(newdata)
#        name birth name0 name1 name2 name3 NLen    SNU SNI     NU    CCU      NG     NV     NW     NC
# 1: 包寒吴霜  1995    包    寒    吴    霜    4 3.0595   2 3.6042 4.1178 -0.2187 3.3542 2.6667 3.2333
# 2:   陈俊霖  1995    陈    俊    霖          3 1.3415   3 2.4619 4.7688  0.4081 4.3125 3.6500 3.6500
# 3:     张伟  1985    张    伟                2 1.1529  26 1.6611 3.8865  0.6859 4.2500 3.5000 3.4000
# 4:     张炜  1988    张    炜                2 1.1529  26 3.0547 5.8583  0.6025 3.9375 3.4000 3.5000
# 5:   欧阳修  1968  欧阳    修                3 3.1645  15 2.9816 3.5510  0.5047 3.0625 3.5000 3.3000
# 6:     欧阳  2009    欧    阳                2 2.9694  15 2.0389 3.4574  0.5103 4.3750 4.1000 3.7000
# 7: 易烊千玺  2000    易    烊    千    玺    4 2.8689  25 3.8743 4.8944  0.4619 3.1875 3.2000 3.1667
# 8:   张艺谋  1950    张    艺    谋          3 1.1529  26 3.8808 3.6611  0.3183 3.5938 3.5500 3.3500
# 9:     王的  2005    王    的                2 1.1257  23 5.1893 1.3110 -0.5325 2.1250 2.5000 2.2000

* Instruction for the rating task of NW and NC (adapted from Newman et al., 2018):

According to psychological research, when people form impressions of others, they usually evaluate them in two aspects: warmth and competence.

Imagine that you are about to meet a person whose given name contains each of the following characters. Please judge how likely he/she is to have traits related to “warmth” (“competence”). If you feel uncertain, please use your intuition and make your best guess.

A Note on Multi-Character Given Names

For a Chinese given name with multiple characters, name indices are averaged across characters. In other words, name indices are computed based on characters rather than character combinations. Here are main reasons.

  1. Computing name variables at the character level is more practical in research. Indeed, character combinations are countless, whereas single characters are a finite set and easy to handle. Moreover, for name indices other than NU, this is the only feasible approach, especially in a large sample. It is impractical to ask participants to rate millions or billions of character combinations.
  2. As evidenced by our research, NU computed by averaging across characters (objective NU) was positively correlated with people’s perception of the uniqueness of their given names (subjective NU): r = 0.32, p < 0.001, N = 672.
  3. As evidenced by our research, among four measures of name uniqueness (two at the character level and two at the character-combination level), only name-character uniqueness (i.e., NU) was positively associated with cultural-level individualism.
  4. In linguistics, a name (word) is to English what a name character (single character) is to Chinese.