Some statistics about the structure of Chinese Characters

October 08, 2010, 03:25 PM posted in General Discussion

I now have some statistics to illustrate some facts about writen Chinese.  For the past year, or so, I have been working on a Web site that contains structural data about characters (see if you are curious).  Anyway, I generated some statistics about character structure based on the characters in the old HSK vocabulary list (a larger set than for the new HSK, and a well defined sub-universe of Chinese characters). 

The first column in the table below is the number of times a character can be broken down into simpler component characters or radicals.  For example, 好 -> 女 + 子.  子 could be broken down into 了 and 一, giving 好 a count of 2. The 11 characters that can't be broken down are characters like 一 and 乙, which are already pretty simple.

The second column is the number of times that a character participates in the formation of a more complex character.  In my example above, 女 and 子 would each get a count for their participation in 好.  Interestingly, while some characters are active joiners, about 80% are stay-at-homes that don't participate in character formation at all, at least, not until someone needs to invent a new character, or at least, not in the sub-universe of characters I chose to work with.

0:               11    2219
1:             533      218
2:             845        94
3:             951        69

4:             432        60

5:              66        38
6:                4        23
7:                0        15
8:                0        13
9:                0        15
10 or more:   0        78

While these statistics are only for a limitted subset of all Chinese characters, I am confident you would see a similar pattern with any other reasonable sized set of Chinese characters.  Also, another person might decompose characters differently than I have.  However, it is only the last level of decomposition into "simplest" elements that is more of an art than a science.  Most decompositions are from one commonly used character into a couple other commonly used characters, like my example with 好.  There seem to be two forces operating.  One is combining a relatively small number of fixed elements to make a large number of characters.  The other is a limit on the acceptable complexity of a character.  The typical character has gone through 2 or three levels of compounding, and is a leaf node in the formation process.

Well, I don't actually know how characters were formed.  It is just my hypothesis from analyzing their appearance.  Perhaps, it is a useful observation.

Profile picture
March 10, 2011, 06:05 AM

“好=女+子.” Correct! But we wouldn't break "子" down into “了”and “一”。

Because "子" is the minimized individual character in Chinese。

More example like: "李=木+子", but "木" will not be broken into "十"and"八"。


Good job though !  *^o^*

Profile picture

Hi Joyce,

At there is an animated demonstration of where I was trying to go with this. I know my breakdown of characters is non-traditional, but I was trying to find a way for us westerners to have an easier time looking up printed characters. I do use my site myself for that, but as far as I know, it hasn't really caught on with anyone else.

Best Regards,


Profile picture

Great Mark~

I'v checked the website~It's very good one to learn Chinese characters.

And FYI I'v done some exams, just for fun~LOL

It's great to see someone is doing this in a western way, Impresive~