Help Test Text Segmenter

nick   December 5th, 2012 3:40p.m.

A lot of the work of supporting the new example sentence system is being able to take messy input and turn it into actual Chinese/Japanese words, handling punctuation and matching/generating readings along with it. The Chinese side of this is getting close to ready, but I need your help to test it:


Please try to break it, and when you do, email the input, expected output, and actual output or error to nick@skritter.com. Looking for things like punctuation problems, simp/trad conversion issues, interface suggestions in the displayed sentence at the bottom, etc.

If a word is automatically segmented incorrectly, that's just going to happen sometimes--hopefully not too often. You can manually adjust it by adding/removing spaces afterward and rerunning the segmentation.

If the text is too long, then it won't work. If there's any interest in using something like this for longer texts, a la Byzanti's Chinese Reader, we could improve this, but I'm not sure how useful this Reader tool could be on its own.

(Japanese doesn't work yet, but most of the code is there.)

Laspimon   December 5th, 2012 7:44p.m.

I have not gotten it to segment anything. There is no output, whatever I write, only the message: "Couldn't parse that input into words. Sorry."

nick   December 5th, 2012 7:56p.m.

Try again now--I temporarily introduced a bug.

Schnabelhund   December 5th, 2012 9:58p.m.

Seems to work fine so far! What kind of things are difficult to segment?

nick   December 6th, 2012 12:02a.m.

So far: things with crazy punctuation, mixes of simp and trad, weird trad variant characters, things with English and numbers in there... it's a bit hard for me to know what still doesn't work, because if I've thought of it, then I've handled it. It's things which I haven't thought of that won't be quite right.

radiator   December 6th, 2012 12:51a.m.

I added a paragraph with 他进

The tool shows no translation on mouseover.

I will email further comments, but by pasting the paragraph into the parser automatically add the words to my words list?

pts   December 6th, 2012 11:45a.m.

Can't parse sentences that start with number, e.g.

nick   December 6th, 2012 12:46p.m.

It doesn't automatically add to My Words, no. I don't think I got your email about which paragraph it was that wasn't showing anything on mouseover--can you let me know what the input was? Skritter doesn't translate the whole paragraph into English; it just glosses each word in the text you give it.

Thanks for the test case, pts. I've fixed that one now.

俞翰森   December 6th, 2012 3:45p.m.

Segmentation seems working very fine but some of the Chinese punctuation get "simplified". E.g. 《...》, becomes 〈....〉

Expressions such as 哇~~ is not segmented at all.

This will fail: 而在综合类大奖最后一项“最佳新人奖”中,将 Fun。 If i remove the English in the end it works. Other English do work however.

I really like this, Very promising.

nick   December 6th, 2012 5:38p.m.

I intended to simplify all the Chinese punctuation when converting from writing to pinyin, yes. I figured that having 《...》 in a pinyin string is not as useful as having . Thoughts?

I think 哇~~ isn't working because Skritter is treating it as a single character (哇) with some extra punctuation (~~), and I set it to not do single characters. For this tool, I can enable it; for example sentences, they'll have to be more than one character (and not be the same as any word in the database, either).

Will figure out how to handle fun!

pts   December 7th, 2012 1:11p.m.

Input: 干 – 經脫水加工製成的乾燥食品。通「乾」。如:「筍干」、「豆腐干」。
Expected traditional output: the same as the input.
Actual traditional output: 乾-經脫水加工製成的乾燥食品。通「乾」。如:「筍干」、「豆腐乾」。

nick   December 7th, 2012 2:14p.m.

Hmm--I'm getting 干 as the first character of the actual output when I do it. Not sure why we're seeing different results. Can you confirm that it's turning 干 by itself into 乾? Does it do it in other contexts?

For 豆腐乾, it's because the word in Skritter had 乾 chosen instead of 干--I've fixed it now, so after that propagates through the cache, it should choose properly for that word.

pts   December 7th, 2012 3:28p.m.

Yes, now 干 is the first character.

In the future, will it change all 豆腐乾 into 豆腐干? But please keep in mind that it should not be changed if the input is 豆腐乾.

nick   December 7th, 2012 5:19p.m.

Yes, since I changed the traditional form to 豆腐干 in our database.

If the word is in the system, Skritter will alter the traditional output to match what Skritter thinks the traditional version of that word is. So it will change 豆腐乾 into 豆腐干, yes. Skritter doesn't support multiple traditional variants for the same simplified multi-character word, and that extends into this mode, with one exception: currently if there are too many characters to look them all up, it doesn't do such a replacement. But if we changed it to be useful for reading long texts, it would chunk the input and then look them up and replace them.

pts   December 8th, 2012 12:26p.m.

Input: 去年3·11日本大地震後經歷了巨大海嘯的宮城、岩手、福島三縣和東北青森縣海岸目前又陷入海嘯警報範圍。
Expected output: same as input
Actual output: 去年3·11日本大地震後經歷瞭巨大海嘯的宮城、岩手、福島三縣和東北青森縣海岸目前又陷入海嘯警報範圍。

The character 了 is incorrectly converted into 瞭.

pts   December 8th, 2012 12:56p.m.

Input: 雖然這次預測浪高是1米以下,不過三縣照例發出了避難勸告,其中宮城縣石卷市鮎川海岸1米高的海嘯已在5點40分抵達。
Expected output: same as input
Actual output: 雖然這次預測浪高是1米以下,不過三縣照例發出瞭避難勸告,其中宮城縣石捲市鮎川海岸1米高的海嘯已在5點40分抵達。

石卷市 is incorrectly converted into 石捲市.
Again 了 is incorrectly converted into 瞭.

pts   December 8th, 2012 1:16p.m.

Can't parse this input:

eurowatz   December 9th, 2012 7:48a.m.

hey rick, do you have a rogh plan when this feature might go online? it's gonna be a great improvement.

nick   December 9th, 2012 2:24p.m.

Thanks, pts--great test cases.

eurowatz, there are many pieces left to do, but we're making good progress. We'll probably start testing it on beta this week or next.

eurowatz   December 10th, 2012 3:51a.m.

great! eager to test it!!

pts   December 10th, 2012 3:50p.m.

Input: 有议员建议,在现有行车道路,开辟单车专用通道,或规定单车优先使用慢线.
Expected output: 有議員建議,在現有行車道路,開闢單車專用通道,或規定單車優先使用慢線.
Actual output: 有議員建議,在現有行車道路,開辟單車專用通道,或規定單車優先使用慢線.

The traditional form of 开辟 is 開闢.

Input: 六四民運領袖王丹十日則在臉書上貼文大罵:「他的無恥可以說是表露無遺了。那些前兩天還在為莫言辯護的網友,應當會有點傻眼吧?」
Expected output: same as input
Actual output: 六四民運領袖王丹十日則在臉書上貼文大罵:「他的無恥可以說是錶露無遺瞭。那些前兩天還在為莫言辯護的網友,應當會有點傻眼吧?」

表露 incorrectly converted into 錶露.
了 is incorrectly converted into 瞭.

pts   December 10th, 2012 3:51p.m.

Input: 香港支聯會十日從中環出發,帶著要寄給諾貝爾和平獎得主劉曉波的耶誕包裹遊行到郵政總局,要求中國政府立即釋放劉曉波、停止軟禁劉妻劉霞。
Expected output: 香港支联会十日从中环出发,带着要寄给诺贝尔和平奖得主刘晓波的耶诞包裹游行到邮政总局,要求中国政府立即释放刘晓波、停止软禁刘妻刘霞。
Actual output:香港支联会十日从中环出发,带著要寄给诺贝尔和平奖得主刘晓波的耶诞包裹游行到邮政总局,要求中国政府立即释放刘晓波、停止软禁刘妻刘霞。

The simplified form of 帶著 is 带着.

nomadwolf   December 17th, 2012 6:02a.m.

A week late, but it won't let me segment anything. Just "Something went wrong..."

nick   December 17th, 2012 1:09p.m.

nomadwolf, what kind of input are you using? It's still working for me on a few test cases I'm trying.

nomadwolf   December 17th, 2012 10:41p.m.

Strangely enough, it works now. :P
Yesterday I had just tried the default text that shows up when you open the page & also a simple sentence with only "!" at the end.

Both are working OK today.

