jansegers is a user on mastodon.host. You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.

Typical lexicon size in natlangs
Gary Shannon | May 11, 2013
Alex Fink wrote:
> For instance, it's a number bandied around that knowing 500 hanzi will allow you
> to read 90% of the characters in a Chinese newspaper -- but usually by people
> who don't appreciate the fact that this includes all the grammatical and closed-class
> words, and a swathe of basic lexis, but probably not the ìnteresting word or two
> in the headline you care about.

For example, if you know the most common 28 words in English you can read 50% of everything written. But what does THAT mean if 50% means that you can read only 50% of each sentence?

[...]

groups.yahoo.com/neo/groups/co

jansegers @jansegers

Or, if you get really ambitious you can learn 732 words and read 90% of everything written in English. If you want to be able to read 99.9% of everything written in English you will need to learn 2090 words. (These figures are from my own million-word corpus taken from 20th century fiction and non-fiction on Gutenberg.com.)

So what does it really mean to say you can read 90% by knowing 732 words?

Maybe the only meaningful measure of lexicon size is how many words you must know to cover some specified x% of the whole of the written corpus. That's a very different number for Toki Pona than it is for English. That way you could talk meaningfully about a specific language's "90% coverage lexicon", and its "98% coverage lexicon", and so on.

--gary

groups.yahoo.com/neo/groups/co

· Web · 1 · 1