About PaCheS

The Chinese<>Spanish Parallel Corpus, PaCheS, is part of an ongoing major project, PaCorES, Parallel Corpora Spanish, which is collecting bilingual parallel corpora with Spanish as the central language. So far, So far, in addition to this one German<>Spanish (www.corpuspages.eu), English<>Spanish (www.corpuspaens.eu) and and French<>Spanish (www.corpuspafres.eu).

The corpus PaCheS is comprised of original texts in Chinese or Spanish and their translation and Chinese and Spanish translations of a third language. So far PaCheS contains some 50 Millions tokens, segmented into 2.177.314 bisegments, i.e. sentence or subsentence aligned pairs of text chunks.

With this corpus we aim at building a multifunctional and representative language resource for the language pair Chinese / Spanish that is able to meet differentiated need of users and that can be exploited for multiple purposes such as general research in contrastive linguistics, linguistic typology, translation studies and bilingual lexicography, as well as the supply of training data to machine translation systems.

Main purpose of the corpus PaCheS is to be a useful and easy to use tool for translators and learners of Chinese or Spanish as Foreign Languages at intermediate and advanced levels. With this tool they can get a multitude of translation suggestions made by humans and presented within examples of real language use.

It currently includes the following collections:

A United Nations Parallel Corpus¹ v1.0 is composed of official records and other parliamentary documents of the United Nations in Chinese and Spanish that are in the public domain. The current version of the corpus comprises content that was produced and manually translated between 1990 and 2014, including sentence-level alignments. In PaCheS, only a portion covering more than 1.2 million bisegments has been included.
Wikimatrix², a corpus of extracted parallel sentences from the content of Wikipedia articles. Chinese texts, originally written in traditional characters, were converted to the simplified system.
ParaCrawl³ v9, is a corpus formed through the automatic extraction of texts from multilingual websites, which are subsequently aligned at the sentence level.
Ted-Talks⁴ is a corpus that collects translations into Chinese and Spanish of the transcriptions of 1369 Ted-Talks from the years 2018 to 2020.

The sentence aligned bisegments of the collections above have undergone a series of semiautomatic revisions: deleting unpaired segments, bisegments without textual interest, and bisegments that were too short or too long (more than 350 characters in Spanish).

As this is an ongoing project, it is planned to add new collections of Chinese<>Spanish bilingual texts of diverse origin in the future.

Notice:

If you use PaCheS in your work, please indicate it and let us know: corpuspaches@usc.es. This way you contribute to the sustainability of the project.

Statistics PaCheS (Stand 2024/02)

COLLECTION	LANGUAGE	*CHARACTERS TOKENS**	BISEGMENTS
United Nations	Chinese	24.031.462	1.219.488
United Nations	Spanish	9.103.395	1.219.488
Paracrawl	Chinese	15.235.788	498.145
Paracrawl	Spanish	7.534.822	498.145
Wikimatrix	Chinese	12.546.473	360.968
Wikimatrix	Spanish	6.028.866	360.968
Ted-Talks	Chinese	3.695.097	98.713
Ted-Talks	Spanish	1.740.554	98.713
Total	Chinese	55.508.820	2.177.314
Total	Spanish	24.407.637	2.177.314

*For Chinese, counting is done in characters, since one character corresponds more closely to one token: ratio ranges from 2 to 3 between Spanish tokens and Chinese characters.