Some statistics about the structure of Chinese Characters (汉字)

October 08, 2010, 03:31 PM posted in General Discussion

I now have some statistics to illustrate some facts about writen Chinese.  For the past year, or so, I have been working on a Web site that contains structural data about characters (see if you are curious).  Anyway, I generated some statistics about character structure based on the characters in the old HSK vocabulary list (a larger set than for the new HSK, and a well defined sub-universe of Chinese characters). 

The first column in the table below is the number of times a character can be broken down into simpler component characters or radicals.  For example, 好 -> 女 + 子.  子 could be broken down into 了 and 一, giving 好 a count of 2. The 11 characters that can't be broken down are characters like 一 and 乙, which are already pretty simple.

The second column is the number of times that a character participates in the formation of a more complex character.  In my example above, 女 and 子 would each get a count for their participation in 好.  Interestingly, while some characters are active joiners, about 80% are stay-at-homes that don't participate in character formation at all, at least, not until someone needs to invent a new character, or at least, not in the sub-universe of characters I chose to work with.

0:                11    2219
1:              533      218
2:              845        94
3:              951        69
4:              432        60
5:                66       38
6:                 4        23
7:                 0        15
8:                 0        13
9:                 0        15
10 or more:    0        78

While these statistics are only for a limitted subset of all Chinese characters, I am confident you would see a similar pattern with any other reasonable sized set of Chinese characters.  Also, another person might decompose characters differently than I have.  However, it is only the last level of decomposition into "simplest" elsements that is more of an art than a science.  Most decompositions are from one commonly used character into a couple other commonly used characters, like my example with 好.  There seem to be two forces operating.  One is combining a relatively small number of fixed elements to make a large number of characters.  The other is a limit on the acceptable complexity of a character.  The typical character has gone through 2 or three levels of compounding, and is a leaf node in the formation process.

Well, I don't actually know how characters were formed.  It is just my hypothesis from analyzing their appearance.  Perhaps, it is a useful observation.

Profile picture
October 08, 2010, 05:08 PM

It's difficult to me.

Profile picture
mark, I was hoping someone might make a less inscrutable comment.

Profile picture
October 10, 2010, 08:42 AM

I can't say that I know all that much about how characters were formed either, but it certainly makes sense to me. If you look at the distribution of characters by their strokes, it's more or less a bell curve peaking around 10 (if I remember correctly, I can't seem to find the chart I saw). There must be a tipping point where adding another element to clarify the difference between two characters reduces the overall clarity of the character by jamming too many strokes into a single frame. I suspect that is the rocks upon which the wave of character creation breaks. 

Profile picture

that makes sense. Nicely put too, hehe

Profile picture
October 10, 2010, 03:29 PM

I think John is right. The distribution is centered about a reasonable level of complexity. Some character parts also seem to be more popular than others. I dont think they had rules, the "rules", if there are any, come later as an attempt to categorize things. It is what it is. They may have followed prior examples however, since character formation is a dynamic thing. I doubt any radical list was decided upon, its just the way things ended up. Its more art than science, but there are patterns.

Profile picture
October 10, 2010, 10:13 PM

If I recall from one of our lessons correctly, 秦始皇 was the first one impose a standardization and simplification of characters.  That would probably be the first time there were rules.  My current impression is that there is definitely some order to characters; you couldn't just draw a picture of something and have anyone think it was a candidate for being a character, for example.

We seem to be in agreement that there is a constraint on complexity.  However, I think there are other constraints as well.  There seems to be a strong tendency towards re-use of existing patterns, and the atomic particals used in forming the molecules are selected from a small number of varieties. 

I think these observations together might be useful in character recognition, human, or otherwise.

Profile picture
October 10, 2010, 10:27 PM

some characters are active joiners, about 80% are stay-at-homes

I think this is where the utility of analysis can come in. Identifying those active joiners would be a worthwhile thing to do, because if you could then learn them you'd get a headstart in learning new characters. I guess this is mostly covered by those radical frequency lists. The question for me is whether there are more complex characters that are active joiners that would not be included in radical frequency lists. I guess they're likely to turn up in the basic/frequent character lists, though a hsk 1 or 2 list? Any insights into these frequent joiners you've noted mark?

Profile picture

The top few "joiners" (characters that participate in the formation of other characters), by my calculations would be:

木,口,一,土,月,日,女, 心,贝 ,又,禾 ,止,人,火,力,目,虫, 石,王,十,尸, 田,广,厶 ,立,米,八 ,小,车, 巾, 马,大, 寸, 页 ,二 ,工, 山,隹,厂,白, 几, 门, 干, 斤, 丁,戈 ,方,欠,穴 ,子, 夕, 儿, 水, 羊,雨, 皿, 者, 手, 分, 少, 酉,中, 刀, 耳, 犬, 匕, 共,殳, 羽, 勿, 走, 且, 弓 ,肖, 圭, 皮, 由, 户,彐,令, 舟, 衣, 舌...

Most are very frequently used by themselves. A few seem to be historical remnants of perhaps previously frequently used characters. Most aren't very complicated. When already complex characters participate in character formation, they usually get a radical tacked on to the left, or on top, but that is usually a onezy-towzy occurrence.

Profile picture
October 18, 2010, 07:00 AM


Profile picture
October 19, 2010, 01:52 PM

awesome, I was just going to ask for the list!  Great job, Mark!


 Is there a way to create a flashcard deck from these?  Or can someone share theirs?

Profile picture
May 20, 2012, 08:31 AM

No there's not.