38 lines
1.8 KiB
Markdown
38 lines
1.8 KiB
Markdown

|
||

|
||

|
||

|
||

|
||

|
||
|
||
# Dutch Word List
|
||
|
||
Last updated: 2023-03-10
|
||
|
||
This repository contains the official OpenTaal Dutch word list, comprising over 400,000 words compiled from contributions and curated sources. The list is provided in UTF-8 encoding and is alphabetically sorted.
|
||
|
||
## Contents
|
||
|
||
### Primary File
|
||
|
||
- **`wordlist.txt`** – Complete UTF-8 word list (one word per line).
|
||
|
||
### Metadata
|
||
|
||
- **`datetimeversion.txt`** – Timestamp and version information.
|
||
|
||
### Component Files
|
||
|
||
- **`elements/basiswoorden-gekeurd.txt`** – Approved base words (~200k entries).
|
||
- **`elements/basiswoorden-ongekeurd.txt`** – Unapproved base words, including proper nouns and compounds (~41k entries).
|
||
- **`elements/flexies-ongekeurd.txt`** – Unapproved inflections (~170k entries).
|
||
- **`elements/wordparts.tsv`** – Word parts containing spaces (TSV format).
|
||
- **`elements/corrections.tsv`** – Common misspellings with corrections (TSV format).
|
||
- **`elements/romeinse-cijfers.txt`** – Roman numerals (~4k entries).
|
||
- **`elements/wordlist-ascii.txt`** – ASCII-only subset (excludes accented characters).
|
||
- **`elements/wordlist-non-ascii.txt`** – Entries containing non-ASCII characters.
|
||
|
||
## Character Set
|
||
|
||
Includes standard Latin letters (a–z, A–Z), Dutch diacritics (e.g., `é`, `ë`, `ï`), superscript/subscript digits (e.g., `²`, `³`), and punctuation: `' . - / + & @ ?`.
|