jansegers is a user on mastodon.host. You can follow them or interact with them if you have an account anywhere in the fediverse. If you don't, you can sign up here.
jansegers @jansegers

Typical lexicon size in natlangs
Gary Shannon | May 11, 2013
Alex Fink wrote:
> For instance, it's a number bandied around that knowing 500 hanzi will allow you
> to read 90% of the characters in a Chinese newspaper -- but usually by people
> who don't appreciate the fact that this includes all the grammatical and closed-class
> words, and a swathe of basic lexis, but probably not the ìnteresting word or two
> in the headline you care about.

For example, if you know the most common 28 words in English you can read 50% of everything written. But what does THAT mean if 50% means that you can read only 50% of each sentence?

[...]

groups.yahoo.com/neo/groups/co

· Web · 0 · 0

Or, if you get really ambitious you can learn 732 words and read 90% of everything written in English. If you want to be able to read 99.9% of everything written in English you will need to learn 2090 words. (These figures are from my own million-word corpus taken from 20th century fiction and non-fiction on Gutenberg.com.)

So what does it really mean to say you can read 90% by knowing 732 words?

Maybe the only meaningful measure of lexicon size is how many words you must know to cover some specified x% of the whole of the written corpus. That's a very different number for Toki Pona than it is for English. That way you could talk meaningfully about a specific language's "90% coverage lexicon", and its "98% coverage lexicon", and so on.

--gary

groups.yahoo.com/neo/groups/co