en | zh | es | gl
|
Publications
|
Team
|
Contact

About PaCheS


The Chinese<>Spanish Parallel Corpus, PaCheS, is part of an ongoing major project, PaCorES, Parallel Corpora Spanish, which is collecting bilingual parallel corpora with Spanish as the central language. So far, So far, in addition to this one German<>Spanish (www.corpuspages.eu), English<>Spanish (www.corpuspaens.eu) and and French<>Spanish (www.corpuspafres.eu).

The corpus PaCheS is comprised of original texts in Chinese or Spanish and their translation and Chinese and Spanish translations of a third language. So far PaCheS contains some 50 Millions tokens, segmented into 2.177.314 bisegments, i.e. sentence or subsentence aligned pairs of text chunks.

With this corpus we aim at building a multifunctional and representative language resource for the language pair Chinese / Spanish that is able to meet differentiated need of users and that can be exploited for multiple purposes such as general research in contrastive linguistics, linguistic typology, translation studies and bilingual lexicography, as well as the supply of training data to machine translation systems.

Main purpose of the corpus PaCheS is to be a useful and easy to use tool for translators and learners of Chinese or Spanish as Foreign Languages at intermediate and advanced levels. With this tool they can get a multitude of translation suggestions made by humans and presented within examples of real language use.

It currently includes the following collections:

  1. A United Nations Parallel Corpus1 v1.0 is composed of official records and other parliamentary documents of the United Nations in Chinese and Spanish that are in the public domain. The current version of the corpus comprises content that was produced and manually translated between 1990 and 2014, including sentence-level alignments. In PaCheS, only a portion covering more than 1.2 million bisegments has been included.
  2. Wikimatrix2, a corpus of extracted parallel sentences from the content of Wikipedia articles. Chinese texts, originally written in traditional characters, were converted to the simplified system.
  3. ParaCrawl3 v9, is a corpus formed through the automatic extraction of texts from multilingual websites, which are subsequently aligned at the sentence level.
  4. Ted-Talks4 is a corpus that collects translations into Chinese and Spanish of the transcriptions of 1369 Ted-Talks from the years 2018 to 2020.

The sentence aligned bisegments of the collections above have undergone a series of semiautomatic revisions: deleting unpaired segments, bisegments without textual interest, and bisegments that were too short or too long (more than 350 characters in Spanish).

As this is an ongoing project, it is planned to add new collections of Chinese<>Spanish bilingual texts of diverse origin in the future.

Notice:

If you use PaCheS in your work, please indicate it and let us know: corpuspaches@usc.es. This way you contribute to the sustainability of the project.

Statistics PaCheS (Stand 2024/02)

COLLECTION LANGUAGE CHARACTERS* TOKENS BISEGMENTS
United Nations   Chinese 24.031.462 1.219.488
Spanish 9.103.395
Paracrawl   Chinese 15.235.788 498.145
Spanish 7.534.822
Wikimatrix   Chinese 12.546.473 360.968
Spanish 6.028.866
Ted-Talks   Chinese 3.695.097 98.713
Spanish 1.740.554
Total   Chinese 55.508.820 2.177.314
Spanish 24.407.637

*For Chinese, counting is done in characters, since one character corresponds more closely to one token: ratio ranges from 2 to 3 between Spanish tokens and Chinese characters.

                                                    
PaCheS Vers. 1.0
Last updated: 15.04.2024
ISLRN 153-041-143-772-3   ©PaCorES
Creative Commons Licencia Creative Commons
University of Santiago de Compostela
This project is funded by the State Research Agency (AEI) of Spanish Ministry of Science, Innovation and University (PID2021-125313OB-I00).