Amazing new feature: Total Recall is back!
Thread poster: 2nl (X)
2nl (X)
2nl (X)  Identity Verified
Netherlands
Local time: 04:12
Oct 11, 2014



Total Recall is here for you and me!

Demonstration with a database with 4673277 TUs from the EU

Loading the database to RAM: http://youtu.be/nd4jj0Ue_EM

Using the gigantic database for auto-assembling: http://youtu.be/WwlmW6ys-VQ

See also: http://cafetran.wikidot.com/total-recall


 
MikeTrans
MikeTrans
Germany
Local time: 04:12
Italian to German
+ ...
Do I understand this well? Nov 8, 2014

Hello,

I allready have an older version of CafeTran, but I don't use it if huge databases are involved. Otherwise I'm just deeply impressed by this translation environment.

So, Total Recall will extract only the 100 (or defined by the user) segments in a huge database for any expressions found in your project that are no-stopwords, collect them and put them in a project database. Essentially making a much smaller and meaningful reference database.
Is this correct
... See more
Hello,

I allready have an older version of CafeTran, but I don't use it if huge databases are involved. Otherwise I'm just deeply impressed by this translation environment.

So, Total Recall will extract only the 100 (or defined by the user) segments in a huge database for any expressions found in your project that are no-stopwords, collect them and put them in a project database. Essentially making a much smaller and meaningful reference database.
Is this correct?

I don't think this feature is present in CafeTran Expresso 2012, or is it?

I will seriously consider an upgrade if the scenario above is true.

Greets,
Mike

[Edited at 2014-11-08 21:53 GMT]
Collapse


 
Meta Arkadia
Meta Arkadia
Local time: 09:12
English to Indonesian
+ ...
Almost correct, Mike Nov 8, 2014

MikeTrans wrote:
So, Total Recall will extract only the 100 (or defined by the user) segments in a huge database for any expressions found in your project that are no-stopwords, collect them and put them in a project database.


CafeTran will extract ALL relevant segments, words or phrases from an H2 database and save them in a TMX file. It will "only" go for a maximum of a 100 (default) hits as to not "overheat" the search process. In other words, if there are more than 100 occurrences of a word or phrase in the database, CT will call it a day, and stop searching for the word/phrase.

In short:

- Let CT convert a resource (TMX or TXT) to an H2 database, or use an existing database
- CT will automatically index* the database
- Start your project, and select Recall segments to memory in Menu | Total Recall
- CT will automatically create a new TMX memory for those segments, and show it in the tabbed pane
- Start translating. CT will use the newly created TMX file like any other TMX file for Auto-Assembly, Auto-Complete, and all other features, while you can still search the indexed H2 database in no time for terms and phrases that haven't been extracted because they didn't match the Project exactly (no fuzziness)

* May take a while

Cheers,

Hans

[Edited at 2014-11-08 22:49 GMT]


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 03:12
Member (2009)
Dutch to English
+ ...
@Hans + Mike: Nov 8, 2014

Meta Arkadia wrote:

MikeTrans wrote:
So, Total Recall will extract only the 100 (or defined by the user) segments in a huge database for any expressions found in your project that are no-stopwords, collect them and put them in a project database.


CafeTran will extract ALL relevant segments, words or phrases from an H2 database and save them in a TMX file. It will "only" go for a maximum of a 100 (default) hits as to not "overheat" the search process. In other words, if there are more than 100 occurrences of a word or phrase in the database, CT will call it a day, and stop searching for the word/phrase.

In short:

- Let CT convert a resource (TMX or TXT) to an H2 database, or use an existing database
- CT will automatically index* the database
- Start your project, and select Recall segments to memory in Menu | Total Recall
- CT will automatically create a new TMX memory for those segments, and show it in the tabbed pane
- Run: Translation > Pretranslate all segments
- Start translating. CT will use the newly created TMX file like any other TMX file for Auto-Assembly, Auto-Complete, and all other features, while you can still search the indexed H2 database in no time for terms and phrases that haven't been extracted because they didn't match the Project exactly (no fuzziness)

* May take a while

Cheers,

Hans

[Edited at 2014-11-08 22:49 GMT]


@Hans: As far as I understand it (this is what Igor told me), you should add the step I added to your list. This will speed up matching/AA from the newly created Total Recall TMX during translation.

@Mike: Indeed, the problem of CT being unable to work with very large databases is now over. I am currently still testing this very new feature, but so far it looks like Total Recall has made CT better at handling large datasets than memoQ, which was previously my favourite CAT tool for dealing with Big data. My Total Recall db now contains around 2 million TUs, and it looks like it can handle a lot more than that. It does help to have a decent computer though. I have 32GB of RAM, a Haswell i7, 2 SSDs, etc. and I am getting very good results in terms of import, indexing & "Total Recall pretranslation" times.

I'd suggest getting the latest version from Igor. You can see all the new stuff here: http://cafetranhelp.com/changelog :)

Michael


 
Meta Arkadia
Meta Arkadia
Local time: 09:12
English to Indonesian
+ ...
Optional Nov 8, 2014

Michael Beijer wrote:
@Hans: As far as I understand it (this is what Igor told me), you should add the step I added to your list. This will speed up matching/AA from the newly created Total Recall TMX during translation.

It depends on the size of the resulting TMX file, and the (average) length of the segments in the Project, I'd say. I'd start translating, and if I notice a serious delay when CT searches for matches in the generated TMX file, I'd go for pretranslation and give it a head-start of a minute or two. No delay, no pretranslation.

Cheers,

Hans


 
Meta Arkadia
Meta Arkadia
Local time: 09:12
English to Indonesian
+ ...
It wasn't that bad before Total Recall, actually Nov 8, 2014

Michael Beijer wrote:
Indeed, the problem of CT being unable to work with very large databases is now over

Well, I think CafeTran handled large resources pretty will even before "Total Recall."

A 2 million segments TMX file was - and still is - loaded in RAM in less than two minutes, and search times were acceptable (for me anyway). Before Total Recall, you could also load resources (TMX and TXT) into the H2 database in about the same time, and it's a once-only process. Searching the database was slower than searching the TMX file of course, but still "doable." I don't think other CAT tools have even reached that level yet. Now, the H2 database is indexed, and manual searches are blistering fast. To the point I use H2 databases for manual search only, and forget about the Recall feature.

Cheers,

Hans


 
Meta Arkadia
Meta Arkadia
Local time: 09:12
English to Indonesian
+ ...
Nope Nov 9, 2014

MikeTrans wrote:
I don't think this feature is present in CafeTran Expresso 2012, or is it?

It isn't, although the H2 database functionality has been there forever. The new features are:
- Indexing the H2 database, which makes searching it very fast, of course
- "Total Recall" that searches the H2 database for any matches for the project and shows the results in a TMX file

And heaps of other new features, of course.

I will seriously consider an upgrade if the scenario above is true.


It's true. But try first, upgrade later. Just download the latest version. The Terms and Conditions have been changed lately, but I think it's fully functional now for 30 days.

Cheers,

Hans


 
MikeTrans
MikeTrans
Germany
Local time: 04:12
Italian to German
+ ...
Thanks Nov 9, 2014

Hello Hans, hello Michael,

thanks for taking the time. That's all very clear and good news indeed.
My current laptop is a i5 with 8 GB Ram, but I will soon get a desktop i7, 16 GB.
I surely will take a look at the new features of the current version.

Thanks and cheers,
Mike

...should watch again this movie some time...

[Edited at 2014-11-09 00:30 GMT]


 
Meta Arkadia
Meta Arkadia
Local time: 09:12
English to Indonesian
+ ...
More than adequate Nov 9, 2014

MikeTrans wrote:
My current laptop is a i5 with 8 GB Ram, but I will soon get a desktop i7, 16 GB.
I surely will take a look at the new features of the current version.


I use a four year old iMac (but the five year old model), with a regular HDD. Okay, I added 8 GB of RAM for a total of 12 GB, but RAM isn't very important for H2/Total Recall (it is for loading huge TMX files as TMX files, though).
So if I mention things like "less than two minutes," it's a figure based on my experience on my iMac. Your laptop should be more than good enough.

Cheers,

Hans


 
Meta Arkadia
Meta Arkadia
Local time: 09:12
English to Indonesian
+ ...
TM vs Total Recall Nov 9, 2014

Meta Arkadia wrote:
Well, I think CafeTran handled large resources pretty will even before "Total Recall.*


I recorded a screencast of the search for two words in the German-Dutch DGT, attached to the project as a TMX memory, and as an indexed H2 database.

http://www.screencast.com/t/95NPzHRJJIpn

Don't worry, it's only 26 seconds short.

The search in the TMX file show how things were before the upgrade. Clicking the icon with the coffee cup marked MS (Memory Search) activates the search in the TMX file, so the "old" way. Clicking the icon marked DS starts the search in the indexed H2 database. A third search - in a newly created "Total Recall" TMX file - would undoubtedly be faster, but I don't think mere mortals will notice the increase in speed.

Relevant technical data:
- DGT GER-DUT, 2 million segments
- Computer: iMac 27", late October 2009 model
- Processor: 3.06 GHz Intel Core 2 Duo
- RAM: 12 GB 1067 MHz DDR3, 8 GB assigned to Java
- Storage: 1 TB rotational HDD

Cheers,

Hans

*Theorem: "People who quote themselves don't have enough work to do." True or False?
True.


 
2nl (X)
2nl (X)  Identity Verified
Netherlands
Local time: 04:12
TOPIC STARTER
Very helpful! Nov 9, 2014

Meta Arkadia wrote:

I recorded a screencast of the search for two words in the German-Dutch DGT, attached to the project as a TMX memory, and as an indexed H2 database.

http://www.screencast.com/t/95NPzHRJJIpn


Very helpful indeed. Thanks very much!


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 03:12
Member (2009)
Dutch to English
+ ...
… the gateway to a more future-ready CT! Nov 9, 2014

One problem though, if you can call it that, is that 2 million TUs is the absolute minimum of what I would a call a big TMX. For example, I want to be able to search my CELEX collection, which is much bigger (all together, it weighs in at 5.34GB!):

some_text

Then there is the DGT-TM TM (the ones that can be downloaded from the EU site):

01. DGT-TM-2007.tmx
01. DGT-TM-2011.tmx
01. DGT-TM-2012.tmx
01. DGT-TM-2013.tmx

+

European Parliament reports_part1.tmx
European Parliament reports_part2.tmx

+

EP plenary session transcripts_part1.tmx
EP plenary session transcripts_part2.tmx
EP plenary session transcripts_part3.tmx
EP plenary session transcripts_part4.tmx

+

Ubuntu (100,000 TUs).tmx
Gnome (90,000 TUs).tmx

+

some_text

and the list goes on.

I know people scoff at "Big Data", but I figure: if it's available, why not give it a whirl.

I'm currently trying to stuff all of it into a single CafeTran H2/Total Recall db

So far I have gotten mixed results.

Also, I have no idea how TMLookup manages to keep its dbs so damned small compared to CT's H2 implementation. E.g., my current H2 db in CT is 35Gb and I only imported a single TMX of 500,000 TUs. I am still waiting for it to "compact". Contrast this with my TMLookup db, which currently contains 40,000,000 TUs (that's 40 million!) and searches are several seconds faster than anything I can achieve with CT's H2 dbs.

Anyway, Total Recall is still very new and will be needing a bit of testing. However, it is definitely the gateway to a more future-ready CT!

Michael


 
Meta Arkadia
Meta Arkadia
Local time: 09:12
English to Indonesian
+ ...
Big Data Nov 9, 2014

Michael Beijer wrote:
I know people scoff at "Big Data"


I do. That is, there's nothing wrong with Big Data if they are well organised, very searchable, and if both the application and the user can handle them in the most efficient way. I don't think that's the case at the moment. I only connect to the 2 million segments 2011 DGT in CafeTran if I do an EU job. If I can't find a word or a phrase in that DB, I may search other DGT resources, but entering the term in SpotLight (OS X) is probably faster and better than importing them in CT. But how often does that happen? And that's search only. If you go for Auto-Assemble (TMX), the algorithm used by CafeTran (and other CAT tools) may spoil the results because of the sheer size of the data. Using the DB as such for AA isn't even possible (yet). And if you use the Recall function to arrive at a TMX that does allow for AA, you miss out on all fuzziness at the DB>TMX phase. No SQL (yet).

In short, I don't suggest you trash all the files you mention above, but I do suggest not to use them as one huge resource in a CAT tool.

Cheers,

Hans


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 03:12
Member (2009)
Dutch to English
+ ...
Preliminary TMX import speed results… Nov 9, 2014

Importing TMXs into my db:

TMX import speeds:

1000 TUs / 4 seconds
10,000 Tus = 40 seconds (0.66 minutes)
100,000 TUs = 400 seconds (= 6.66 minutes)
500,000 TUs = 2,000 seconds (= 33.33 minutes)

Not bad!

Michael

-----------------------------------------------------------------------------------------*
Relevant technical data:
- CELEX TMXs (courte
... See more
Importing TMXs into my db:

TMX import speeds:

1000 TUs / 4 seconds
10,000 Tus = 40 seconds (0.66 minutes)
100,000 TUs = 400 seconds (= 6.66 minutes)
500,000 TUs = 2,000 seconds (= 33.33 minutes)

Not bad!

Michael

-----------------------------------------------------------------------------------------*
Relevant technical data:
- CELEX TMXs (courtesy of András Farkas)
- Computer: Dell Precision M6800 (laptop), 2014
- Processor: 2.70 GHz Intel i7 Haswell
- RAM: 32 GB 1866MHz DDR3, 10 GB assigned to CT
- Storage: H2 db stored on hybrid 1TB disk. CT installed on SSD
Collapse


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Natalie[Call to this topic]

You can also contact site staff by submitting a support request »

Amazing new feature: Total Recall is back!






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »