inital-commit
This commit is contained in:
37
vocab/README.md
Normal file
37
vocab/README.md
Normal file
@@ -0,0 +1,37 @@
|
||||

|
||||

|
||||

|
||||

|
||||

|
||||

|
||||
|
||||
# Dutch Word List
|
||||
|
||||
Last updated: 2023-03-10
|
||||
|
||||
This repository contains the official OpenTaal Dutch word list, comprising over 400,000 words compiled from contributions and curated sources. The list is provided in UTF-8 encoding and is alphabetically sorted.
|
||||
|
||||
## Contents
|
||||
|
||||
### Primary File
|
||||
|
||||
- **`wordlist.txt`** – Complete UTF-8 word list (one word per line).
|
||||
|
||||
### Metadata
|
||||
|
||||
- **`datetimeversion.txt`** – Timestamp and version information.
|
||||
|
||||
### Component Files
|
||||
|
||||
- **`elements/basiswoorden-gekeurd.txt`** – Approved base words (~200k entries).
|
||||
- **`elements/basiswoorden-ongekeurd.txt`** – Unapproved base words, including proper nouns and compounds (~41k entries).
|
||||
- **`elements/flexies-ongekeurd.txt`** – Unapproved inflections (~170k entries).
|
||||
- **`elements/wordparts.tsv`** – Word parts containing spaces (TSV format).
|
||||
- **`elements/corrections.tsv`** – Common misspellings with corrections (TSV format).
|
||||
- **`elements/romeinse-cijfers.txt`** – Roman numerals (~4k entries).
|
||||
- **`elements/wordlist-ascii.txt`** – ASCII-only subset (excludes accented characters).
|
||||
- **`elements/wordlist-non-ascii.txt`** – Entries containing non-ASCII characters.
|
||||
|
||||
## Character Set
|
||||
|
||||
Includes standard Latin letters (a–z, A–Z), Dutch diacritics (e.g., `é`, `ë`, `ï`), superscript/subscript digits (e.g., `²`, `³`), and punctuation: `' . - / + & @ ?`.
|
||||
Reference in New Issue
Block a user