This website uses cookies to enhance your user experience. (When you click "OK" a cookie is placed on your computer to hide this pop-up. The cookie does not contain any personal information.)

Usage restrictions

The corpora available from the SADiLaR Corpus Portal are primarily available for research purposes only. Individual corpora may have less restrictive licensing, and may be downloaded from the links provided in the descriptions.

All corpora used for any purpose must be referenced according to the referencing information provided in the description.

Annotation

Corpora for the indegeneous South African languages are automatically annotated for lemmas and part of speech using the available NCHLT Text lemmatisers and part of speech taggers. Information on the accuracy and tag sets for these languages is available here: NCHLT Web Service. No quality control of the automatic annotations were performed.

The English data is annotated using the open-source NLP4J library available here.

Autshumato Afrikaans Corpus 1.0

The Afrikaans side of the Autshumato English-Afrikaans parallel corpus, primarily from South African government websites.

Reference

Department of Arts and Culture & CTexT. 2013. Autshumato English-Afrikaans Parallel Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 744-411-936-476-1. Available: https://repo.sadilar.org/handle/20.500.12185/397

Size

2,626,138 tokens

Autshumato isiZulu Corpus 1.0

The isiZulu side of the Autshumato English-isiZulu parallel corpus, primarily from South African government websites.

Reference

Department of Arts and Culture & CTexT. 2013. Autshumato English-isiZulu Parallel Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 101-618-922-810-4. Available: https://hdl.handle.net/20.500.12185/399

Size

433,887 tokens

Autshumato Sesotho sa Leboa Corpus 1.0

The Sesotho sa Leboa side of the Autshumato English-Sesotho sa Leboa parallel corpus, primarily from South African government websites.

Reference

Department of Arts and Culture & CTexT. 2013. Autshumato English-Sesotho sa Leboa Parallel Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 954-612-592-883-6. Available: https://hdl.handle.net/20.500.12185/402

Size

880,512 tokens

Autshumato Setswana Corpus 1.0

The Setswana side of the Autshumato English-Setswana parallel corpus, primarily from South African government websites.

Reference

Department of Arts and Culture & CTexT. 2016. Autshumato English-Setswana Parallel Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 379-219-829-093-2. Available: https://hdl.handle.net/20.500.12185/404

Size

2,873,637 tokens

Autshumato Xitsonga Corpus 1.0

The Xitsonga side of the Autshumato English-Xitsonga parallel corpus, primarily from South African government websites.

Reference

Department of Arts and Culture & CTexT. 2014. Autshumato English-Xitsonga Parallel Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 463-910-862-996-9. Available: https://repo.sadilar.org/handle/20.500.12185/405

Size

4,718,935 tokens

NCHLT Afrikaans Text Corpus 1.0

A collection of documents from the South African government domain crawled from gov.za websites and collected from various language units.

Reference

Department of Arts and Culture & CTexT. 2013. Afrikaans NCHLT Text Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 544-932-849-161-3. Available: http://rma.nwu.ac.za/index.php/resource-catalogue/afrikaans-nchlt-text-corpora.html

Size

2,574,308 words

NCHLT English Text Corpus 1.0

A collection of documents from the South African government domain crawled from gov.za websites and collected from various language units.

Reference

Department of Arts and Culture & CTexT. 2013. English NCHLT Text Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 481-998-928-542-6. Available: http://rma.nwu.ac.za/index.php/resource-catalogue/english-nchlt-text-corpora.html

Size

12,762,146 words

NCHLT isiNdebele Text Corpus 1.0

A collection of documents from the South African government domain crawled from gov.za websites and collected from various language units.

Reference

Department of Arts and Culture & CTexT. 2013. isiNdebele NCHLT Text Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 858-844-618-880-3. Available: http://rma.nwu.ac.za/index.php/resource-catalogue/isindebele-nchlt-text-corpora.html

Size

1,061,626 words

NCHLT isiXhosa Text Corpus 1.0

A collection of documents from the South African government domain crawled from gov.za websites and collected from various language units.

Reference

Department of Arts and Culture & CTexT. 2013. isiXhosa NCHLT Text Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 848-955-511-452-0. Available: http://rma.nwu.ac.za/index.php/resource-catalogue/isixhosa-nchlt-text-corpora.html

Size

1,500,602 words

NCHLT isiZulu Text Corpus 1.0

A collection of documents from the South African government domain crawled from gov.za websites and collected from various language units.

Reference

Department of Arts and Culture & CTexT. 2013. isiZulu NCHLT Text Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 356-122-378-131-9. Available: http://rma.nwu.ac.za/index.php/resource-catalogue/isizulu-nchlt-text-corpora.html

Size

1,965,473 words

NCHLT Sepedi Text Corpus 1.0

A collection of documents from the South African government domain crawled from gov.za websites and collected from various language units.

Reference

Department of Arts and Culture & CTexT. 2013. Sepedi NCHLT Text Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 484-745-801-426-4. Available: http://rma.nwu.ac.za/index.php/resource-catalogue/sepedi-nchlt-text-corpora.html

Size

2,442,432 words

NCHLT Sesotho Text Corpus 1.0

A collection of documents from the South African government domain crawled from gov.za websites and collected from various language units.

Reference

Department of Arts and Culture & CTexT. 2013. Sesotho NCHLT Text Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 209-669-000-528-9. Available: http://rma.nwu.ac.za/index.php/resource-catalogue/sesotho-nchlt-text-corpora.html

Size

2,001,558 words

NCHLT Setswana Text Corpus 1.0

A collection of documents from the South African government domain crawled from gov.za websites and collected from various language units.

Reference

Department of Arts and Culture & CTexT. 2013. Setswana NCHLT Text Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 664-418-763-519-0. Available: http://rma.nwu.ac.za/index.php/resource-catalogue/setswana-nchlt-text-corpora.html

Size

1,394,260 words

NCHLT SiSwati Text Corpus 1.0

A collection of documents from the South African government domain crawled from gov.za websites and collected from various language units.

Reference

Department of Arts and Culture & CTexT. 2013. SiSwati NCHLT Text Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 093-210-851-959-9. Available: http://rma.nwu.ac.za/index.php/resource-catalogue/siswati-nchlt-text-corpora.html

Size

1,112,804 words

NCHLT Tshivenḓa Text Corpus 1.0

A collection of documents from the South African government domain crawled from gov.za websites and collected from various language units.

Reference

Department of Arts and Culture & CTexT. 2013. Tshivenḓa NCHLT Text Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 450-604-191-615-2. Available: http://rma.nwu.ac.za/index.php/resource-catalogue/tshivenda-nchlt-text-corpora.html

Size

1,084,354 words

NCHLT Xitsonga Text Corpus 1.0

A collection of documents from the South African government domain crawled from gov.za websites and collected from various language units.

Reference

Department of Arts and Culture & CTexT. 2013. Xitsonga NCHLT Text Corpora. Potchefstroom: CTexT, North-West University. ISLRN: 324-577-661-200-9. Available: http://rma.nwu.ac.za/index.php/resource-catalogue/xitsonga-nchlt-text-corpora.html

Size

1,445,830 words

Oxford University Press-SADiLaR Afrikaans Corpus 1.0

A collection of Afrikaans literature works from Oxford University Press, South Africa. This corpus is strictly available for research purposes only.

© Oxford University Press Southern Africa (Pty) Ltd. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press Southem Africa (Pty) Ltd, or as expressly permitted by law, or under terms agreed with the appropriate designated reprographics rights organisation.

Reference

Oxford University Press & SADiLaR. 2019. Oxford University Press-SADiLaR Afrikaans Corpus 1.0. Potchefstroom: SADiLaR, North-West University. Available: SADiLaR Corpus portal: https://corpus.sadilar.org/corpusportal/explore/corpus

Size

443,815 tokens

Oxford University Press-SADiLaR isiXhosa Corpus 1.0

A collection of isiXhosa literature works from Oxford University Press, South Africa. This corpus is strictly available for research purposes only.

© Oxford University Press Southern Africa (Pty) Ltd. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press Southem Africa (Pty) Ltd, or as expressly permitted by law, or under terms agreed with the appropriate designated reprographics rights organisation.

Reference

Oxford University Press & SADiLaR. 2019. Oxford University Press-SADiLaR isiXhosa Corpus 1.0. Potchefstroom: SADiLaR, North-West University. Available: SADiLaR Corpus portal: https://corpus.sadilar.org/corpusportal/explore/corpus

Size

1,007,321 tokens

Oxford University Press-SADiLaR isiZulu Corpus 1.0

A collection of isiZulu literature works from Oxford University Press, South Africa. This corpus is strictly available for research purposes.

© Oxford University Press Southern Africa (Pty) Ltd. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press Southem Africa (Pty) Ltd, or as expressly permitted by law, or under terms agreed with the appropriate designated reprographics rights organisation.

Reference

Oxford University Press & SADiLaR. 2019. Oxford University Press-SADiLaR isiZulu Corpus 1.0. Potchefstroom: SADiLaR, North-West University. Available: SADiLaR Corpus portal: https://corpus.sadilar.org/corpusportal/explore/corpus

Size

552,227 tokens

Oxford University Press-SADiLaR Sesotho sa Leboa Corpus 1.0

A collection of Sesotho sa Leboa literature works from Oxford University Press, South Africa. This corpus is strictly available for research purposes.

© Oxford University Press Southern Africa (Pty) Ltd. All rights reserved. No part of this pubfication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press Southem Africa (Pty) Ltd, or as expressly permitted by law, or under terms agreed with the appropriate designated reprographics rights organisation.

Reference

Oxford University Press & SADiLaR. 2019. Oxford University Press-SADiLaR Sesotho sa Leboa Corpus 1.0. Potchefstroom: SADiLaR, North-West University. Available: SADiLaR Corpus portal: https://corpus.sadilar.org/corpusportal/explore/corpus

Size

197,352 tokens

Oxford University Press-SADiLaR Setswana Corpus 1.0

A collection of Setswana literature works from Oxford University Press, South Africa. This corpus is strictly available for research purposes.

© Oxford University Press Southern Africa (Pty) Ltd. All rights reserved. No part of this pubfication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press Southem Africa (Pty) Ltd, or as expressly permitted by law, or under terms agreed with the appropriate designated reprographics rights organisation.

Reference

Oxford University Press & SADiLaR. 2019. Oxford University Press-SADiLaR Setswana Corpus 1.0. Potchefstroom: SADiLaR, North-West University. Available: SADiLaR Corpus portal: https://corpus.sadilar.org/corpusportal/explore/corpus

Size

159,091 tokens

Oxford University Press-SADiLaR SiSwati Corpus 1.0

A collection of SiSwati literature works from Oxford University Press, South Africa. This corpus is strictly available for research purposes.

© Oxford University Press Southern Africa (Pty) Ltd. All rights reserved. No part of this pubfication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press Southem Africa (Pty) Ltd, or as expressly permitted by law, or under terms agreed with the appropriate designated reprographics rights organisation.

Reference

Oxford University Press & SADiLaR. 2019. Oxford University Press-SADiLaR SiSwati Corpus 1.0. Potchefstroom: SADiLaR, North-West University. Available: SADiLaR Corpus portal: https://corpus.sadilar.org/corpusportal/explore/corpus

Size

91,129 tokens