Where do these gigantic TMs come from?
Thread poster: Hans Lenting
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Jun 18, 2012

Hi,

I've been reading postings in this forums about TMs as large as 300,000 - 350,000 TUs. I was wondering: where do TMs in this size come from? I image they are created by a team of translators during many years? Or are these TMs created by one or two translators, covering several subjects and fields?

Aren't there ways to compact these gigantic TMs by filtering?

Depending on how far you want to go with this, you can remove:

- TUs that only dif
... See more
Hi,

I've been reading postings in this forums about TMs as large as 300,000 - 350,000 TUs. I was wondering: where do TMs in this size come from? I image they are created by a team of translators during many years? Or are these TMs created by one or two translators, covering several subjects and fields?

Aren't there ways to compact these gigantic TMs by filtering?

Depending on how far you want to go with this, you can remove:

- TUs that only differ in numbers
- TUs that only differ in tags
- TUs that only differ in leading/trailing spaces
- TUs that only differ in trailing punctuation characters
- TUs that contain source and target segments that are identical
- TUs that only differ in case (upper, lower, mixed)
- TUs that only differ in number of internal spaces and/or tabs
etc.

Cheers,

Hans
Collapse


 
Selcuk Akyuz
Selcuk Akyuz  Identity Verified
Türkiye
Local time: 18:23
English to Turkish
+ ...
I use smaller TMs Jun 18, 2012

Hi Hans,

My Big Mama has 60K segments but possibly 20K can be easily removed. That 300K segments TM I mentioned in another message was created from Microsoft CSV files, I rarely use it mostly when testing the limits of a CAT tool.

Many people use the DGT Multilingual Translation Memory, it is not available in my language pair so I can't say if it is useful or not. But they are really gigantic, some over 1 or 2 million segments! But in any case I prefer my own smaller T
... See more
Hi Hans,

My Big Mama has 60K segments but possibly 20K can be easily removed. That 300K segments TM I mentioned in another message was created from Microsoft CSV files, I rarely use it mostly when testing the limits of a CAT tool.

Many people use the DGT Multilingual Translation Memory, it is not available in my language pair so I can't say if it is useful or not. But they are really gigantic, some over 1 or 2 million segments! But in any case I prefer my own smaller TMs.
Collapse


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Testing with EU DGT TM/gigantic TM in CafeTran Jun 18, 2012

Selcuk Akyuz wrote:

Many people use the DGT Multilingual Translation Memory, it is not available in my language pair so I can't say if it is useful or not. But they are really gigantic, some over 1 or 2 million segments! But in any case I prefer my own smaller TMs.


Hello Selcuk,

I've download the DEU-NLD part of the EU DGT TM here:
http://lossner.net/downloads/EU_DE-NL.zip

I've ran some tests on my MacBook Pro with 8 GB and 2048 RAM assigned to CafeTran (running in 32 bit mode).

I've changed the language codes from DE-DE to de-de and from NL-NL to nl-nl first, saved as UTF-8, clicked Options | Filter Source=Target.

I've loaded the TM in edit mode in 2 or 3 seconds:

http://dl.dropbox.com/u/15919910/MBP/number%20of%20tus.png

(the last TUs come from another test project – they were added to the DGT TM )

I've created a small Word document containing some German sentences from the TM, both in literal form and in a sligthly modified form. I've used this test document with the TM:

http://dl.dropbox.com/u/15919910/MBP/98%20percent.png

One tag had to be inserted, recognition took 2 or 3 seconds (which is a little long).

Here I changed some numbers:

http://dl.dropbox.com/u/15919910/MBP/another%2095%20percent.png

Recognition speed was acceptable too.

I guess that recognition speed can be increased when using a DB, but I don't want this overhead (besides that: I never use TMs that large).

Cheers,

Hans



[Bearbeitet am 2012-06-18 19:45 GMT]


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 15:23
Member (2009)
Dutch to English
+ ...
very large translation memories (for free) Jun 18, 2012

Hi Hans,

A good place to start looking would be the Opus 'open parallel corpus', which is here:

http://opus.lingfil.uu.se/

A quick look reveals that there are approx. 4.2M segments in German-Dutch, for example, spread across 9 TMXs:

Europarl3
OpenSubtitles2011
EMEA
ECB
KDE4
OpenSubtitles
PHP
EUconst
KDEdoc
... See more
Hi Hans,

A good place to start looking would be the Opus 'open parallel corpus', which is here:

http://opus.lingfil.uu.se/

A quick look reveals that there are approx. 4.2M segments in German-Dutch, for example, spread across 9 TMXs:

Europarl3
OpenSubtitles2011
EMEA
ECB
KDE4
OpenSubtitles
PHP
EUconst
KDEdoc

Then there is the DGT Multilingual Translation Memory, as Selcuk mentioned. The Dutch-English DGT-TM-2011 has almost 2,000,000 segments. I have it in memoQ, along with zillions of others. However, you are going to need an SSD, a lot of RAM, and a very fast computer for all of this to load quickly in your concordance search window.

Michael


sorry, seems we cross-posted there

[Edited at 2012-06-18 19:47 GMT]
Collapse


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Thanks for the reference Jun 18, 2012

Michael Beijer wrote:

sorry, seems we cross-posted there

[Edited at 2012-06-18 19:47 GMT]


Hi Michael,

Never mind . Thanks for the references. I'll have a look at them.

Cheerio,

Hans


 
Selcuk Akyuz
Selcuk Akyuz  Identity Verified
Türkiye
Local time: 18:23
English to Turkish
+ ...
Camera symbol Jun 18, 2012

Hans,

What is that camera symbol (button?) and the slider next to it? Are they in MAC version only? I always update CT replacing the files sent by Igor. Perhaps I should use the installer again. But then the license details will change (?)

I cannot change the colours of lines, yours orange... but they look like drawn in Paint.


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Snapshot? Jun 19, 2012

Selcuk Akyuz wrote:

Hans,

What is that camera symbol (button?) and the slider next to it? Are they in MAC version only? I always update CT replacing the files sent by Igor. Perhaps I should use the installer again. But then the license details will change (?)

I cannot change the colours of lines, yours orange... but they look like drawn in Paint.



Hi Selcuk,

I'm not sure what the camera symbol is for. This one, Igor will have to explain.

As for the slider, please watch this movie:

http://cafetran.wordpress.com/2012/06/09/polish-polish-cafetran-with-minimalistic-design/

The slider can be used for changing transparency in one of the FX skins.

BTW: The horizontal lines in the glossary pane aren't drawn in Paint: they were introduced in a yet to be released build. Notice the new grid, enabling even more rows on the screen being displayed .

Cheers,

Hans

[Bearbeitet am 2012-06-19 07:05 GMT]


 
Igor Kmitowski
Igor Kmitowski  Identity Verified
Poland
Local time: 16:23
Member (2016)
English to Polish
+ ...
Transparent themes Jun 19, 2012

Hi Selcuk,

Working with scanned documents I found it comfortable to have CafeTran windows transparent with reduced transparency so that the image of the document would be visible without the need to minimize the CT window.

The two CafeTran themes (menu Edit | Appearance | Themes | FX white/black) help to achieve that effect. After choosing one of the themes, you will get the camera symbol to refresh background desktop and the slider to set the transparency.

... See more
Hi Selcuk,

Working with scanned documents I found it comfortable to have CafeTran windows transparent with reduced transparency so that the image of the document would be visible without the need to minimize the CT window.

The two CafeTran themes (menu Edit | Appearance | Themes | FX white/black) help to achieve that effect. After choosing one of the themes, you will get the camera symbol to refresh background desktop and the slider to set the transparency.

Hans happened to find the other use of the transparent themes to customize the look of the program. Here, you can replace the desktop background snapshot with you preferred image or pattern (btw. the wooden or grass patterns look nice). Go to the menu Appearance | Themes | Background image to choose you favorite image/pattern file.

Igor

Selcuk Akyuz wrote:

Hans,

What is that camera symbol (button?) and the slider next to it? Are they in MAC version only? I always update CT replacing the files sent by Igor. Perhaps I should use the installer again. But then the license details will change (?)

I cannot change the colours of lines, yours orange... but they look like drawn in Paint.

Collapse


 
Selcuk Akyuz
Selcuk Akyuz  Identity Verified
Türkiye
Local time: 18:23
English to Turkish
+ ...
snapshot saved? Jun 19, 2012

Thanks Igor,

I see the potential when working with scanned documents, I use two monitors but this feature may help when I am away.

As for the snapshots, does CT save them to a temp file? Ctrl+V does not work.

Selcuk


 
Igor Kmitowski
Igor Kmitowski  Identity Verified
Poland
Local time: 16:23
Member (2016)
English to Polish
+ ...
Refreshed but not saved Jun 19, 2012

The camera icon only refreshes the current desktop background for the transparency in CT to see the actual content. It does not save it. There are system tools such as Print Screen to let you do it.

Igor

Selcuk Akyuz wrote:

Thanks Igor,

I see the potential when working with scanned documents, I use two monitors but this feature may help when I am away.

As for the snapshots, does CT save them to a temp file? Ctrl+V does not work.

Selcuk


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie[Call to this topic]

You can also contact site staff by submitting a support request »

Where do these gigantic TMs come from?






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »