The Chinese<>Spanish Parallel Corpus, PaCheS, is part of an ongoing major project, PaCorES, Parallel Corpora Spanish, which is collecting bilingual parallel corpora with Spanish as the central language. So far, So far, in addition to this one German<>Spanish (www.corpuspages.eu), English<>Spanish (www.corpuspaens.eu) and and French<>Spanish (www.corpuspafres.eu).
The corpus PaCheS is comprised of original texts in Chinese or Spanish and their translation and Chinese and Spanish translations of a third language. So far PaCheS contains some 50 Millions tokens, segmented into 2.177.314 bisegments, i.e. sentence or subsentence aligned pairs of text chunks.
With this corpus we aim at building a multifunctional and representative language resource for the language pair Chinese / Spanish that is able to meet differentiated need of users and that can be exploited for multiple purposes such as general research in contrastive linguistics, linguistic typology, translation studies and bilingual lexicography, as well as the supply of training data to machine translation systems.
Main purpose of the corpus PaCheS is to be a useful and easy to use tool for translators and learners of Chinese or Spanish as Foreign Languages at intermediate and advanced levels. With this tool they can get a multitude of translation suggestions made by humans and presented within examples of real language use.
It currently includes the following collections:
The sentence aligned bisegments of the collections above have undergone a series of semiautomatic revisions: deleting unpaired segments, bisegments without textual interest, and bisegments that were too short or too long (more than 350 characters in Spanish).
As this is an ongoing project, it is planned to add new collections of Chinese<>Spanish bilingual texts of diverse origin in the future.
Notice:
If you use PaCheS in your work, please indicate it and let us know: corpuspaches@usc.es. This way you contribute to the sustainability of the project.
Statistics PaCheS (Stand 2024/02)
COLLECTION | LANGUAGE | CHARACTERS* TOKENS | BISEGMENTS |
United Nations | Chinese | 24.031.462 | 1.219.488 |
Spanish | 9.103.395 | ||
Paracrawl | Chinese | 15.235.788 | 498.145 |
Spanish | 7.534.822 | ||
Wikimatrix | Chinese | 12.546.473 | 360.968 |
Spanish | 6.028.866 | ||
Ted-Talks | Chinese | 3.695.097 | 98.713 |
Spanish | 1.740.554 | ||
Total | Chinese | 55.508.820 | 2.177.314 |
Spanish | 24.407.637 |
*For Chinese, counting is done in characters, since one character corresponds more closely to one token: ratio ranges from 2 to 3 between Spanish tokens and Chinese characters.