Looks like the Great Firewall or something like it is preventing you from completely loading www.skritter.com because it is hosted on Google App Engine, which is periodically blocked. Try instead our mirror:

legacy.skritter.cn

This might also be caused by an internet filter, such as SafeEyes. If you have such a filter installed, try adding appspot.com to the list of allowed domains.

Word frequency in daily news

Mandarinboy   June 3rd, 2011 7:55a.m.

I have now for fun been scanning some 10.000+ newspapers to get a more updated list with word frequency in daily news. It is actually rather big differences between that and what all the normal frequency lists mention. I updated my code to dot net during my latest trip and let a computer run day and night for a few days. Naturally there are a lot of names such as presidents etc but also a lot of computer related words, much about finance terms etc. Really interesting. I then also did a another test and scanned specific topics such as politics, international, military, woman, pop culture etc. Big differences there as well. I now plan to scan 5 different newspapers for a month to get the best possible word frequency from an wider array of topics. There are enormous differences between e.g. a chat forum and a paper about politics. Once done and cleaned I will upload the the top 1000 words to a public list if anyone is interested. The only problem I can see now is that I am using CDICT as my word list to match against. It is big but not big enough to have every word in it. Still, it is good enough i think.

Roland   June 3rd, 2011 8:47a.m.

Hi Mandarinboy, that sounds great; I would be definitely interested. Are you also able to do the same with subtitles of some soaps etc., to get everyday language? I have very often the feeling, that what I am learning from textbooks, HSK lists, etc. is not what people use in everyday language.

Mandarinboy   June 3rd, 2011 9:06a.m.

I share that feeling as well. I am planning to do that for TV shows/soap operas / music lyrics as well. I have great sources for lyrics but not so much for subtitles yet. Once I found that or can generate from all the shows my wife watch i can get the word frequency easily. I plan to do separate lists for separate topics but also one for all of them combined to get a wider scope. Especially web forums seems to be very much closer to daily usage of words and phrases. I will on my flight to China tomorrow update my code to be multi threaded to speed up the harvesting. I have access to very powerful servers at work but that is no excuse to produce sloppy code.

Kai Carver   June 3rd, 2011 9:22a.m.

sounds interesting!

It's one of my goals to be able to read a newspaper in Chinese the way I can read a paper in Spanish and German, i.e., pretty well for newsy articles where I am familiar with the subject.

(also, just curious, what do you mean by "I updated my code to dot net"?)

Mandarinboy   June 3rd, 2011 9:37a.m.

@Kai Carver, My old code i wrote some 10+ years ago in visual basic 6 just to test some ideas. Now I did a try to make it more modern so i switch to Microsoft dot net. Dot net is basically just a framework where you can use different programming languages to write you code on. I usually do that in C# but for now i had to go with visual basic for dot net since i do not have the time to rewrite it all in C#. Plan to do that "in the future" :-) The code is soon done for release so anyone can use it on an windows machine. With that you can scan your own newspapers for suitable articles or harvest words from your favorite site.

nick   June 3rd, 2011 1:10p.m.

Mandarinboy, would you benefit from combining some of the Skritter dictionary data with the CC-CEDICT data in order to match more words? I might be able to send you a list of just the 汉字, of which there are hundreds of thousands of entries (haven't counted recently).

Mandarinboy   June 3rd, 2011 3:14p.m.

@Nick, Thanks! I would love that. It would be very helpful. Since I have the computer power i need for this I can then scan all sort of sources to get useful word lists. I think that both combined and specialized word list can be very useful. I will make them public as soon as I have enough data. In general I like to scan at least 10.000.000 words as a minimum to get best accuracy. Currently I am at some 4.000.000. If this turns out well I will go on with Japanese lists after this.

nicogo   June 3rd, 2011 4:54p.m.

This sounds fantastic !! I am eager to see the results and to compare with the traditional frequency lists. Thanks for this promising work for all the chinese-learners community !

nick   June 3rd, 2011 5:59p.m.

Sent it to you.

Phoboss   June 3rd, 2011 7:26p.m.

This is a very crucial study for us, chinese learners, so thank you very much for your efforts Mandarinboy *thumpsup*

alxx   June 4th, 2011 10:21p.m.

Sounds great. Very interested to give it a go.

Also interested in how you are doing it ?
connect to newspaper site , go through article links grab the text then filter removing repeated characters ?
(I usually work on embedded systems/sensor systems)

Mandarinboy   June 5th, 2011 7:52a.m.

@aixx, basically i crawl several news sites for articles and then grabbing the inner body text of the pages and parse each of those for words and characters. The words are matched against the Cedict Chinese word list. I also harvest the individual characters to get another angel on the whole thing. Some editing will be done to the lists since many headers etc are the same on all sites. Words such as news, politics etc are not that frequent in the news body as the word count can suggest. I also run a few instances of the program against different sources to get the usage from forums and chats as well as different news sites such as news, international, finance etc. The database that I end up with I will run in my BI tools to get some nice stats. I have some ideas to get one step further later on an also harvest whole sentences to get a sort of frequency on that as well. On many chats, forums, blogs etc you have a very natural usage of the language that suits my purposes. Having that data it will be an easy match to get useful sentences/expressions for the words i learn to get context.

This forum is now read only. Please go to Skritter Discourse Forum instead to start a new conversation!