Segments created at double-spaces
Auteur du fil: Nicolas Gambardella

Nicolas Gambardella
Royaume-Uni
Local time: 02:13
Membre (2019)
anglais vers français
+ ...
Feb 27

Hello,

I am handling larges Powerpoint documents created by OCR. The OCR inserted double spaces between each word. As a result, CTE creates one segment per word. Sorry for the probably elementary and naive question, how can I configure CTE so that it does not segment at double spaces, but keep the sentences complete?


 

Igor Kmitowski  Identity Verified
Pologne
Local time: 03:13
Membre (2016)
anglais vers polonais
+ ...
Double spaces Feb 27

Hello Nicolas,

In the Dashboard, click the menu icon and select Preferences. Then, change segmentation from the default Sentence to Rules.srx and create a new project.

Alternatively if possible, you might search and replace double space with single space in Powerpoint before creating a new project with this document in CafeTran.


 

Nicolas Gambardella
Royaume-Uni
Local time: 02:13
Membre (2019)
anglais vers français
+ ...
AUTEUR DU FIL
Thanks for the suggestions Feb 27

Thanks Igor.

I have to learn the language to build rules before using them. I tested CTE rule editor and am none the wiser. I will definitely learn how do so. But not today since the project had ultra-tight deadlines.

I hoped to avoid fiddling with the PPTX beforehand because LibreOffice screws the layout quite a bit. And I actually have no idea of what the final document will look like on MS Powerpoint under Windows. But find/replace seems the quickest way for today.
... See more
Thanks Igor.

I have to learn the language to build rules before using them. I tested CTE rule editor and am none the wiser. I will definitely learn how do so. But not today since the project had ultra-tight deadlines.

I hoped to avoid fiddling with the PPTX beforehand because LibreOffice screws the layout quite a bit. And I actually have no idea of what the final document will look like on MS Powerpoint under Windows. But find/replace seems the quickest way for today. I just checked and indeed, CTE segmentation is much better.
Collapse


 

Jean Dimitriadis  Identity Verified
anglais vers français
+ ...
Segmentation Feb 27

Regarding segmentation, since CafeTran supports the Segmentation Rules eXchange (SRX) format, you can also add other SRX files, such as the one OmegaT uses, the advantage being that it is already populated with various language-specific rules, so you don't have to tweak them much, if at all.

You can find some related information and suggestions at: ...
See more
Regarding segmentation, since CafeTran supports the Segmentation Rules eXchange (SRX) format, you can also add other SRX files, such as the one OmegaT uses, the advantage being that it is already populated with various language-specific rules, so you don't have to tweak them much, if at all.

You can find some related information and suggestions at: https://github.com/idimitriadis0/TheCafeTranFiles/wiki/1-Preferences#segmentation
Collapse


 

Hans Lenting  Identity Verified
Pays-Bas
Membre (2006)
allemand vers néerlandais
No splitting Feb 27

For what it's worth: when I use my SRX file no_splitting_after_colons.srx, I get this result:

Screenshot 2020-02-27 at 11.55.28

(You can send me a PM, if you want to test the SRX file.)

Another approach, that requires some fiddling, would be:

  • Add .ZIP to the file name of the presentation.
  • Open the ZIP file in a file browser.
  • Navigate to the folder that contains the slides.
  • Open them in a UTF-compatible editor.
  • Replace all double spaces with single spaces.
  • Save and close all files.
  • Remove the '.zip' string from the file name of the presentation.


Screenshot 2020-02-27 at 11.59.51


 

Nicolas Gambardella
Royaume-Uni
Local time: 02:13
Membre (2019)
anglais vers français
+ ...
AUTEUR DU FIL
It's a winner Feb 27

Thanks Hans,

No need to rename the file since unzip knows (under Linux) that pptx are zip files (well, it should always know that but anyway...)

I did replace all the double spaces that way, and the result is perfect as expected. No messing of the text layout as with LO, and the file size is the same (of course) instead of double when saved with LO.

I would, of course, prefer to use the SRX path in the future.


Hans Lenting
 

Hans Lenting  Identity Verified
Pays-Bas
Membre (2006)
allemand vers néerlandais
New feature? Feb 27

Nicolas Gambardella wrote:

I did replace all the double spaces that way, and the result is perfect as expected. No messing of the text layout as with LO, and the file size is the same (of course) instead of double when saved with LO.

I would, of course, prefer to use the SRX path in the future.


Good to hear!

On several occasions I've thought: it would be nice if the filter tab of the Project Configuration wizard would offer a way to have double spaces (and only double ones, not triple ones etc.) be replaced by single spaces.


 


To report site rules violations or get help, contact a site moderator:

Modérateur(s) de ce forum
Natalie[Call to this topic]

You can also contact site staff by submitting a support request »

Segments created at double-spaces

Advanced search






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
SDL MultiTerm 2021
One central location to store and manage multilingual terminology.

By providing access to all those involved in applying terminology (such as engineers, marketers, translators, and terminologists), our terminology management solution ensures consistent and high-quality content from source through to translation.

More info »



Forums
  • All of ProZ.com
  • Recherche par terme
  • Travaux
  • Forums
  • Multiple search