Looks like the Great Firewall or something like it is preventing you from completely loading www.skritter.com because it is hosted on Google App Engine, which is periodically blocked. Try instead our mirror:

legacy.skritter.cn

This might also be caused by an internet filter, such as SafeEyes. If you have such a filter installed, try adding appspot.com to the list of allowed domains.

Word difficulty statistics

GrandPoohBlah   September 24th, 2011 10:09p.m.

I was just wondering a couple things.

Is it possible to get a comprehensive list of words/characters at each difficulty level (i.e. easy, easier, medium, harder, hard) so I can see how complete my vocab is at each level, or even just so I can study from a complete list of easy words?

Is it possible to get distribution statistics for lists and My Words? I am curious how many words that I've learned or that I've added to a list are easy/medium/hard.

YueMeigui   September 25th, 2011 3:38a.m.

I get the baidu 001 baidu 002 usernames for the spam but I'm wondering if Alex Mathers is a real person whose account was hacked.

scott   September 25th, 2011 11:15a.m.

@GrandPoohBlah: We don't have a way to get that info that's built into the site but we can generate a report for you if you'd like.

So to make sure I understand, for the first one you simply want to fetch all words you're studying and organize them by toughness.

And for the second one, you'd like a table, columns being toughness and rows being lists, and numbers in cells showing how many words that you've learned for each list for each toughness?

@YueMeigui: All four of the (now deleted) spam messages were submitted by people who were not logged in, so no worries there.

GrandPoohBlah   September 25th, 2011 12:17p.m.

Naw, don't worry about it. I was hoping there would be some feature on the website so that I could observe how these statistics change as I study more words and more lists or so that I could check the relative difficulty of other people's lists.

As for the first request, I was thinking more along the lines of automatically generated lists, one for each difficulty level, that contain all words of that difficulty level that are in Skritter's database, mostly just to sate my curiosity.

edit: Since we're on the topic, if you can generate a spreadsheet of this kind of data, why can't you write a script to automatically generate statistics for a list?

swimming   September 25th, 2011 1:10p.m.

How are the difficulty levels determined?

nick   September 25th, 2011 2:46p.m.

Ooo, relative difficulty of others' lists: now that's an interesting idea. I don't know how feasible it is to have those calculations run and keep the list up to date, but it would be an interesting sort. (It's not hard to do it once, but to keep it adjusted when list edits are made is harder.)

The toughness is described in a little detail here:
http://blog.skritter.com/2011/01/toughness-indicators-in-word-popup.html

Now that I'm seeing how little effect the "toughness" has on actual difficulty, I'm thinking of renaming it to something like "usefulness", "importance", or "frequency".

GrandPoohBlah, they don't have all the words in them, but our HSK lists are sorted by toughness (within each list, as each list's contents are already set by the testmakers). So you can get some sense of how many of those words you've learned as go through HSK sections.

InkCube   September 25th, 2011 5:50p.m.

I'd like to use the opportunity to make a related request that I've been thinking about lately.

I think it would be really neat if in the word pop-up it would show the HSK level (if applicable) in the empty space next to the toughness level.

It would come especially handy if you click on a unknown word in an example sentence or check out the words that contain a specific character.

In my mind it would work like the info of trad/simpl versions and just be there if the word is in a HSK list.

GrandPoohBlah   September 25th, 2011 5:51p.m.

@inkubus: I like that idea too. MDBG and other online Chinese dictionaries have already implemented this, so I don't imagine it would be too hard to integrate into Skritter.

nick   September 25th, 2011 8:43p.m.

What do you see the HSK level providing that the toughness statistic doesn't provide?

GrandPoohBlah   September 25th, 2011 11:26p.m.

Well, for starters, the HSK is a standardized test used for assessing proficiency in Chinese. The toughness statistic, though not completely arbitrary, is not linked to any standardized, universal metric. However, since not all words in the database are also on the HSK, it's useful to have both.

石磊   September 26th, 2011 2:12a.m.

Unfortunately the new HSK levels are not an accurate predictor of a word's frequency of usage.

For example the first word in the new HSK 6 list:
"哦 ò, ó, é: oh (indicates understanding), oh (indicates doubt)"
is one of the 100 most frequent words used in Chinese film and TV subtitles, see academic research at:
http://expsy.ugent.be/subtlex-ch/

(MandarinBoy has published the top 1-5k "Chinese movie word frequency list"s on Scritter.)

Antimacassar   September 26th, 2011 3:09a.m.

It would also be useful to me. Because when I add an individual character (but know no words that contain it) I like to add a word at the same time, and I would prefer to add words that are on the HSK list than not (of course I can do this by looking in a list, but would be quicker if I could see on Skritter)

InkCube   September 26th, 2011 6:20a.m.

Well, I figure the toughness indicator and the HSK level are both not perfect tools but they could kind of give you a second opinion.

Antimacassar   September 27th, 2011 5:28a.m.

@石磊: re: "the 1-5k "Chinese movie word frequency list"s". They are indeed useful, but I do wonder exactly how entirely useful they are based on one fact. There are a large amount of foreign names (i.e. 彼特,查理 等等), and zero Chinese names (at least as far as I can tell). This makes me think that the statistics that the lists are based on are translations of foreign movies into Chinese. Not that that would present a problem, I just wonder if the list would've been any different if it was based exclusively on movies that were originally in Chinese (I assume that they would be, to what degree is of course harder to tell).

Also, remember that statistics for movies are (I also assume) going to be quite different that those for newspapers, books, and the spoken language( sadly no statistics for that). Ok, maybe you could argue that it's closer to conversational language, but that could be a false assumption (how many of us talk like we are on a movie?). It's possible that you wont end up being better at conversation but better at watching movies with the (Chinese) subtitles on.

swimming   September 27th, 2011 5:58a.m.

I had exactly the same impression while looking at the "Chinese movie word frequency lists". The list should probably be renamed to "Chinese subtitles word frequency lists".

jww1066   September 27th, 2011 7:25a.m.

@Antimacassar we already have a number of lists based on other frequency studies, and most of those are based on newspapers and books. So the whole idea with the subtitle frequency was to look at something that would at least be closer to spoken Chinese than the dry vocabulary you get from written sources. I'm not sure whether the movies used were originally Chinese, though.

James

Antimacassar   September 27th, 2011 9:46a.m.

@jww1066 totally agree that it's better than what we had before and useful to study.

However, I just wonder how much closer it exactly is to spoken Chinese than the other lists. I mean it's not clear to me to what extent movies are an accurate reflection of the spoken language (not the mention the point that, as far as I can tell, they are based on translations of foreign ((from Chinese POV)) movies)

Mandarinboy   September 27th, 2011 10:55a.m.

The mentioned list is based on a study conducted by the University of Ghent, Belgium. I did a similar myself before I found their papers. I did then cross check both of them to find that they are very close. In both cases we have been using subtitles from normal daily shows, movies ( both Chinese and foreign). In Ghent they did an excellent study on the words they did get by the use of native Chinese to validate the word bank. It do in fact reflect the daily usage very well. Naturally there are many words that are more frequent in newspapers and others that are more frequent in e.g historical Chinese dramas. If you are interested I suggest that you do read the papers they did write about their study and how it reflects daily usage. Very interesting. My main point is that there is no list that can reflect all Chinese usage. This list is close to the daily usage of standard conversation but will lack some of the words that are more frequent in e.g. political newspapers etc . To make up for that i did also scan some 40.000 newspaper articles (I am still scanning more) In those list there are more words about Internet, politics, countries, politicians etc. Still, the majority of the actual words are very much the same. For fun I did also scan articles that my wife reads, fashion, local Hangzhou news, education etc. It is not the exactly the same as the general list but all the standard words are. So, there is no THE list, just many A lists.

Link to Ghent study: http://expsy.ugent.be/subtlex-ch/

nick   September 27th, 2011 5:42p.m.

HSK levels seem to be a poor metric to me, and I hardly want to have two metrics for the same thing displayed. The toughness metric we use already takes into account to some degree which HSK list a word is in, since HSK lists are some of the textbooks consulted in the calculation of toughness.

I think it would be good to redo the toughness calculations to include Mandarinboy's excellent frequency lists, but it's not something high on the priority list.

Antimacassar   September 27th, 2011 7:06p.m.

@Mandarinboy. It's certainly a great list, and there are lots of words on it that I knew from spoken conversation that weren't on other lists, indicating its usefulness.

It's just the point about given names which strikes me as slightly incongruous. I have come across (I guess) more than 10 translations for Western given names (e.g. 比尔) and not one Chinese given name. It's as if there were a word frequency list for English that didn't contain any English names but did have Klaus and ZeDong.

The only explanation I can think of is that Chinese names are so diverse that they don't hit high on frequency ratings (does anyone know if there exists a Chinese equivalent of Michael or John for example?).

Even so, I doubt the usefulness of learning these particularly words (as compared to the rest), since you're unlikely to need them.

Mandarinboy   September 28th, 2011 1:47a.m.

I am very,very sorry, that is actually a blunder from my side. For chinese names they are categorised as personal names but with an translation to english as blank. I did filter out all blank english strings to not mess things up with some other stuff I am playing with,totally forgetting this with Chinese names. Naturally i Should replace this with something like "Common Chinese name" or similar. For forreign names, such as 比尔 ( Bill ) they always do have an english translation so they did not suffer from that. In my own list I do have the proper name definition but in the Ghent list this is blanked out. There are other lists with just Chinese names but I can also reprocess this list and include the Chinese names. Can't update an published list though so that I might have to create a new if it is required. I will try to cross reference a list with Chinese names that do include famous Chinese and their line of work as guidance. That might give an better "translation/explanation to the names. When doing an quick look in the current database there are much more Chinese names than foreign so that should hopefully calm down your worries. I think that around 95%+ of the films we did process where Chinese by origin and remaining mostly Asian and American top movies.

Antimacassar   September 28th, 2011 3:48a.m.

ok, no probs. Like i say, it's still a great list and v.useful :)

nick   September 28th, 2011 11:02a.m.

Updating published lists is coming very soon!

Antimacassar   October 12th, 2011 7:52a.m.

@Mandarinboy

I just thought that I would point out that there are also a large number of U.S. cities in the frequency lists. I wasn't sure if this was, as with the names, a slight oversight or not so just thought I would point it out (if the the list ever gets updated could be useful).

One more thing. The Chengyu frequency list is divided in to 5 sections but they all have seemingly random numbers of words in them. I just wondered if there was any reason for this (e.g. the first group is most frequent and so on)

Mandarinboy   October 12th, 2011 10:02a.m.

@Antimacassar US cities are actually that frequent. I did a specialized search in the subtitles since I too found that strange. There are actually a lot of talk about cousins/friends/relatives etc in the US that make up for that. Especially the "dramas" tends to bring up relatives or study in the US. I will update the lists as soon as we can do that with published lists.

As for Chengyu that is just plain and simple random numbers. I did sort them in correct order but when i moved to the skritter i just took "enough" words for each section. No thoughts there, just laziness. There are so many more chengyu in the database so I did just concentrate on the very frequent ones.

This forum is now read only. Please go to Skritter Discourse Forum instead to start a new conversation!