This website uses cookies to enhance your user experience. (When you click "OK" a cookie is placed on your computer to hide this pop-up. The cookie does not contain any personal information.)
The corpora available from the SADiLaR Corpus Portal are primarily available for research purposes only. Individual corpora may have less restrictive licensing, and may be downloaded from the links provided in the descriptions.
All corpora used for any purpose must be referenced according to the referencing information provided in the description.
Annotation
Corpora for the indegeneous South African languages are automatically annotated for lemmas and part of speech using the available NCHLT Text lemmatisers and part of speech taggers. Information on the accuracy and tag sets for these languages is available here: NCHLT Web Service. No quality control of the automatic annotations were performed.
The English data is annotated using the open-source NLP4J library available here.
Autshumato Afrikaans Corpus 1.0
The Afrikaans side of the Autshumato English-Afrikaans parallel corpus, primarily from South African government websites.
Reference
Department of Arts and Culture & CTexT. 2013. Autshumato English-Afrikaans Parallel Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 744-411-936-476-1. Available: https://repo.sadilar.org/handle/20.500.12185/397
Size
2,626,138 tokens
Autshumato isiZulu Corpus 1.0
The isiZulu side of the Autshumato English-isiZulu parallel corpus, primarily from South African government websites.
Reference
Department of Arts and Culture & CTexT. 2013. Autshumato English-isiZulu Parallel Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 101-618-922-810-4. Available: https://hdl.handle.net/20.500.12185/399
Size
433,887 tokens
Autshumato Sesotho sa Leboa Corpus 1.0
The Sesotho sa Leboa side of the Autshumato English-Sesotho sa Leboa parallel corpus, primarily from South African government websites.
Reference
Department of Arts and Culture & CTexT. 2013. Autshumato English-Sesotho sa Leboa Parallel Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 954-612-592-883-6. Available: https://hdl.handle.net/20.500.12185/402
Size
880,512 tokens
Autshumato Setswana Corpus 1.0
The Setswana side of the Autshumato English-Setswana parallel corpus, primarily from South African government websites.
Reference
Department of Arts and Culture & CTexT. 2016. Autshumato English-Setswana Parallel Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 379-219-829-093-2. Available: https://hdl.handle.net/20.500.12185/404
Size
2,873,637 tokens
Autshumato Xitsonga Corpus 1.0
The Xitsonga side of the Autshumato English-Xitsonga parallel corpus, primarily from South African government websites.
Reference
Department of Arts and Culture & CTexT. 2014. Autshumato English-Xitsonga Parallel Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 463-910-862-996-9. Available: https://repo.sadilar.org/handle/20.500.12185/405
Size
4,718,935 tokens
NCHLT Afrikaans Text Corpus 1.0
A collection of documents from the South African government domain crawled from gov.za websites and collected from various language units.
Oxford University Press & SADiLaR. 2019. Oxford University Press-SADiLaR Sesotho sa Leboa Corpus 1.0. Potchefstroom: SADiLaR, North-West University. Available: SADiLaR Corpus portal: https://corpus.sadilar.org/corpusportal/explore/corpus
Size
197,352 tokens
Oxford University Press-SADiLaR Setswana Corpus 1.0
A collection of Setswana literature works from Oxford University Press, South Africa. This corpus is strictly available for research purposes.