Pages in topic:   < [1 2 3 4 5] >
tmx from Parallel corpus of Patent Translation Resource?
Thread poster: Noe Tessmann
Noe Tessmann
Noe Tessmann  Identity Verified
Local time: 23:46
English to German
+ ...
TOPIC STARTER
Import took several days Jan 2, 2015

Dear Michael,

thanks a lot for uploading, finally it worked. I imported the file but it took litterally days as a background task. I have to wait until my broken Asus is back. Working on a low performer PC is a pain in ...

Good times ahead

Noe


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 22:46
Member (2009)
Dutch to English
+ ...
@Roberto: Jan 2, 2015

Robert Bononno wrote:

I have the source files in FR and EN but don't believe I have any software or text editor that can manipulate and join the larger files (2.7 GB, 3+GB). One of my text editors, TextWrangler, refuses to open them. TextEdit will open the smaller ones but I haven't tried the larger files. I'm very reluctant to try to manipulate these in Excel; it's going to generate a humongous file. I have 8 GB RAM on the machine but these are big files. Might be easier to simply search the contents of the corpus on line (if possible).


Macs aren't great at handling very large text files, which is one of my reasons for sticking with Windows. You might want to ask over on the CafeTran mailing list, as several people use Macs there and are quite knowledgeable when it comes to this stuff.

I don't think anyone has added these files to an online database yet. However, they might pop up on the Opus site one of these days, which I recommend you have a look at every now and again: http://opus.lingfil.uu.se/ (the site also has a rudimentary online search interface)

Michael

https://groups.google.com/forum/#!forum/cafetranslators

[Edited at 2015-01-02 21:29 GMT]


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 22:46
Member (2009)
Dutch to English
+ ...
@Noe: Jan 2, 2015

Noe Tessmann wrote:

Dear Michael,

thanks a lot for uploading, finally it worked. I imported the file but it took litterally days as a background task. I have to wait until my broken Asus is back. Working on a low performer PC is a pain in ...

Good times ahead

Noe



Cool, that's good to hear.

Incidentally, are you actually interested in the metadata? If not, it would simplify the process of converting the data to a TMX somewhat. No big deal if you want it though. it's just an extra step of two.

I will most likely be doing the other folders ("title", "description" and "claims") sometime this weekend.

Michael


 
Meta Arkadia
Meta Arkadia
Local time: 05:46
English to Indonesian
+ ...
Mac Jan 2, 2015

Robert Bononno wrote:
One of my text editors, TextWrangler, refuses to open them.

Like for its big brother BBEdit, the maximum text file size for TextWrangler is limited to 384 MB, the text file size limit for OS X. To open and edit large text files, you'll have to use either a Java app and assign enough RAM to the heap or a Unix app, or split the files into ones less than 384 MB (the trick Michael's EmEditor uses). You can do the splitting in the Terminal.

Cheers,

Hans


 
Noe Tessmann
Noe Tessmann  Identity Verified
Local time: 23:46
English to German
+ ...
TOPIC STARTER
Anyone already aligned the other parts (description, ...)? Feb 23, 2015

Hi,

so my laptop has finally been fixed. Has anyone (Michael you're the master of alignment) already aligned the other parts of this patent corpus. Abstracts are already really helpful.

Kind regards and a nice new week

Noe


This corpus site doesn't seem to be online.


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 22:46
Member (2009)
Dutch to English
+ ...
metadata too? Feb 23, 2015

Hi Noe,

No, I never got around to it. I can do it in the next day or so. Do you want/need the metadata? If not, it would be faster/easier to align.

Michael


 
Noe Tessmann
Noe Tessmann  Identity Verified
Local time: 23:46
English to German
+ ...
TOPIC STARTER
metadata are not so important. Feb 23, 2015

Dearest Michael,


I think metadata are not so important. I don't need to know where exactly the translation comes from.
Whenever you have time. It's not urgent. I am fine with the abstracts part you kindly aligned.

All the best

Noe



Michael Beijer wrote:

Hi Noe,

No, I never got around to it. I can do it in the next day or so. Do you want/need the metadata? If not, it would be faster/easier to align.

Michael


 
Jean Lachaud
Jean Lachaud  Identity Verified
United States
Local time: 18:46
English to French
+ ...
FR/EN and EN/FR too, please Feb 23, 2015

Michael:

I am interested in the FR/En and EN/FR versions, too. Or maybe a more detailed description of the workflow (I'm a Windows user).

Thanks in advance.

JL

Michael Beijer wrote:

Hi Noe,

No, I never got around to it. I can do it in the next day or so. Do you want/need the metadata? If not, it would be faster/easier to align.

Michael


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 22:46
Member (2009)
Dutch to English
+ ...
Phew! (PatTR: Patent Translation Resource files converted to TMXs) Feb 24, 2015

Wow, these files are very, very big.

OK, so I managed to do the first part of the Claims batch. Claims is so big, I will have to split it up into around 11 batches of 1,000,000 TUs each. That is, it will be spread across 11 TMXs.

Claims #1 = here: (1)-PatTR-CLAIMS-(de-en)(TUs-1-1,000,000).tmx (185 MB)
• <
... See more
Wow, these files are very, very big.

OK, so I managed to do the first part of the Claims batch. Claims is so big, I will have to split it up into around 11 batches of 1,000,000 TUs each. That is, it will be spread across 11 TMXs.

Claims #1 = here: (1)-PatTR-CLAIMS-(de-en)(TUs-1-1,000,000).tmx (185 MB)
Claims #2 = here: (2)-PatTR-CLAIMS-(de-en)(TUs-1,000,000-2,000,000).tmx (175.48 MB)

I re-uploaded the TMX derived from "Abstract" here: PatTR-ABSTRACT-(de-en)(718,201-TUs).tmx (134.01 MB)

TMXs for Claims #2-11 will follow as soon as I have a moment, and FR↔EN after that!

Here, in a nutshell, is my workflow:

• append .txt to file names
• open files in EmEditor (or a good text editor capable of opening large files; UltraEdit is also good)
• split these .txt files into manageable chunks (of 1 million TUs/lines each)
• in Ron's CSV Editor, create empty file and paste in contents of .txt files (of src + trgt language) to create a tab-delimited .csv
• in Xbench, convert aforementioned .csv to .tmx;
• in Heartsome TMX editor, edit the TMX custom attributes and clean up the TMX (remove duplicates).

Michael

PS: Not sure what's going on with the Opus corpora site (http://opus.lingfil.uu.se/ ).
PPS: Original files here: http://www.cl.uni-heidelberg.de/statnlpgroup/pattr/

[Edited at 2015-02-24 14:05 GMT]

[Edited at 2015-02-24 14:05 GMT]

[Edited at 2015-02-24 14:06 GMT]
Collapse


 
Noe Tessmann
Noe Tessmann  Identity Verified
Local time: 23:46
English to German
+ ...
TOPIC STARTER
You're my hero of the year Feb 24, 2015

Dear Michael,

now I realize how complicate this alignment must be. 6 steps to get a usable tmx file out of it. I never could have figured that out. Thanks so much. I'll suck the claims part into my TM and enjoy.


Strange that nobody before tried to align this really good stuff.


Kindest regards

Noe


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 22:46
Member (2009)
Dutch to English
+ ...
You're welcome! Feb 25, 2015

Yes, I keep wondering whether someone else might already have done it, and whether it might already be available somewhere else…

Michael


 
Noe Tessmann
Noe Tessmann  Identity Verified
Local time: 23:46
English to German
+ ...
TOPIC STARTER
1st part digested Feb 25, 2015

Dear Michael,

it took half a day to import the 1st part of Claims into MemoQ. This is much more than for Istvan's EU TMs. It's really a lot.

Thanks once again, I'll test it next week with a patent translation.

KR

Noe


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 22:46
Member (2009)
Dutch to English
+ ...
Hi Noe, Feb 26, 2015

I think the only way to really search amounts of data of this size is to use something like TMLookup: http://www.farkastranslations.com/tmlookup.php

You can easily import all of these TMXs (or .txt files) (and a lot more) into a TMLookup database and then search it all as fast as lightning. It generally works a lot faster than any CAT tool I have ever tried (I've tried CafeTran
... See more
I think the only way to really search amounts of data of this size is to use something like TMLookup: http://www.farkastranslations.com/tmlookup.php

You can easily import all of these TMXs (or .txt files) (and a lot more) into a TMLookup database and then search it all as fast as lightning. It generally works a lot faster than any CAT tool I have ever tried (I've tried CafeTran, SDL Studio, memoQ, Felix, DVX2, Wordfast, Fluency and a few others).

I finished the entire CLAIMS batch (9 TMXs in total), and am currently uploading them all. I'll post links when they are ready!

Michael
Collapse


 
Noe Tessmann
Noe Tessmann  Identity Verified
Local time: 23:46
English to German
+ ...
TOPIC STARTER
Really incredible Feb 26, 2015

Dear Michael,

incredible you really managed to convert the whole corpus. Really amazing.
I already use the lookup tool for Andras' EU-TMs via Intelliwebsearch. You're right nothing is faster than this tool.

Can't wait to download the stuff

Kind regards

Noe


 
2nl (X)
2nl (X)  Identity Verified
Netherlands
Local time: 23:46
UltraEdit handles large files Feb 27, 2015

UltraEdit for Mac handles large files. A must have editor for OS X, even if you have TW.

 
Pages in topic:   < [1 2 3 4 5] >


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

tmx from Parallel corpus of Patent Translation Resource?






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »