inital-commit

This commit is contained in:
mike
2025-12-19 14:02:07 +01:00
commit 6f0a04f0a1
91 changed files with 1398111 additions and 0 deletions

37
vocab/README.md Normal file
View File

@@ -0,0 +1,37 @@
![GitHub last commit](https://img.shields.io/github/last-commit/opentaal/opentaal-wordlist)
![GitHub commit activity](https://img.shields.io/github/commit-activity/y/opentaal/opentaal-wordlist)
![GitHub Repo stars](https://img.shields.io/github/stars/opentaal/opentaal-wordlist)
![GitHub watchers](https://img.shields.io/github/watchers/opentaal/opentaal-wordlist)
![GitHub Sponsors](https://img.shields.io/github/sponsors/opentaal)
![Liberapay patrons](https://img.shields.io/liberapay/patrons/opentaal)
# Dutch Word List
Last updated: 2023-03-10
This repository contains the official OpenTaal Dutch word list, comprising over 400,000 words compiled from contributions and curated sources. The list is provided in UTF-8 encoding and is alphabetically sorted.
## Contents
### Primary File
- **`wordlist.txt`** Complete UTF-8 word list (one word per line).
### Metadata
- **`datetimeversion.txt`** Timestamp and version information.
### Component Files
- **`elements/basiswoorden-gekeurd.txt`** Approved base words (~200k entries).
- **`elements/basiswoorden-ongekeurd.txt`** Unapproved base words, including proper nouns and compounds (~41k entries).
- **`elements/flexies-ongekeurd.txt`** Unapproved inflections (~170k entries).
- **`elements/wordparts.tsv`** Word parts containing spaces (TSV format).
- **`elements/corrections.tsv`** Common misspellings with corrections (TSV format).
- **`elements/romeinse-cijfers.txt`** Roman numerals (~4k entries).
- **`elements/wordlist-ascii.txt`** ASCII-only subset (excludes accented characters).
- **`elements/wordlist-non-ascii.txt`** Entries containing non-ASCII characters.
## Character Set
Includes standard Latin letters (az, AZ), Dutch diacritics (e.g., `é`, `ë`, `ï`), superscript/subscript digits (e.g., `²`, `³`), and punctuation: `' . - / + & @ ?`.