PDA

View Full Version : Text Tools Development



N1LAF
01-21-2012, 06:44 PM
Ever tried to copy text from a PDF to a word processor, only to manually delete line feeds so that the copied text would fit the page width? I have developed a tool to work with the text clipboard in Windows. Select the processes to be performed, then copy text from one application, and click [ GO ] on my tool. Paste to destination.

Processes I have programmed in:
- Remove all carriage returns
- Remove single Line Feed incidences
- Replace single Line Feed Incidence with a space
- Convert to HTML
- Convert to HTML Page
- Convert HTML to Text
- Remove Leading white spaces
- Remove training white spaces
- Remove both leading/trailing white spaces


What is missing?

Any interest in a tool like this?

W4GPL
01-21-2012, 09:17 PM
Yes, and I have all of them in Linux. :P Stop trying to reinvent the wheel.

KC2UGV
01-21-2012, 09:21 PM
http://en.wikipedia.org/wiki/Pdftotext

N1LAF
01-21-2012, 10:02 PM
Doesn't have to be PDF file, can be any source. Suppose you use Microsoft Document Imager to OCR from a graphic, as a source.

KC2UGV
01-21-2012, 10:14 PM
Doesn't have to be PDF file, can be any source. Suppose you use Microsoft Document Imager to OCR from a graphic, as a source.

http://man.cat-v.org/plan_9/1/doc2txt

There are lots of text transformation tools out there, open sourced.

N1LAF
01-21-2012, 10:29 PM
http://man.cat-v.org/plan_9/1/doc2txt

There are lots of text transformation tools out there, open sourced.

Yup, and you didn't create any of them

KC2UGV
01-21-2012, 10:41 PM
Yup, and you didn't create any of them

Why would I? They are about 20 years old now, and have been polished up quite nicely. No need to reinvent the wheel :)

n2ize
01-24-2012, 11:34 AM
http://man.cat-v.org/plan_9/1/doc2txt

There are lots of text transformation tools out there, open sourced.

Where are these commands located. I tried

$ locate doc2txt
$which doc2txt

and I came up empty.

Then I tried

$ yum -search doc2txt
$ yum -provides doc2txt

and again I got bupkiss.

Are these part of a package under a different name ? Pray tell ?

KC2UGV
01-24-2012, 11:39 AM
Where are these commands located. I tried

$ locate doc2txt
$which doc2txt

and I came up empty.

Then I tried

$ yum -search doc2txt
$ yum -provides doc2txt

and again I got bupkiss.

Are these part of a package under a different name ? Pray tell ?

I'm trying to find them now, coming up empty as well. I know there was a binary with a name like that, at one point on my system. But, in Linux, more than one way to skin a cat:


abiword --to=txt myfile.doc

KB3LAZ
01-24-2012, 02:36 PM
Ever tried to copy text from a PDF to a word processor, only to manually delete line feeds so that the copied text would fit the page width? I have developed a tool to work with the text clipboard in Windows. Select the processes to be performed, then copy text from one application, and click [ GO ] on my tool. Paste to destination.

Processes I have programmed in:
- Remove all carriage returns
- Remove single Line Feed incidences
- Replace single Line Feed Incidence with a space
- Convert to HTML
- Convert to HTML Page
- Convert HTML to Text
- Remove Leading white spaces
- Remove training white spaces
- Remove both leading/trailing white spaces


What is missing?

Any interest in a tool like this?

Never had an issue when copying from PDF to text. Nice that you like to tinker with things though. As for there being other programs out there, keep working at yours. Maybe you can make the best version. :)

WØTKX
01-24-2012, 03:57 PM
We could always discuss the merits ot this... :roll:




gsave
1 0.5 scale
70 100 48 0 360 arc
fillgrestore
/Helvetica-Bold 14 selectfont
1.0 setgray29 45 moveto
(Hello, world!) show
showpage

KC2UGV
01-24-2012, 04:01 PM
I'm more a docbook type person...



<article xmlns='http://docbook.org/ns/docbook'>
<title>Example example</title>
<example xml:id="ex.dssslfunction">
<title>A DSSSL Function</title>
<programlisting>

(define (node-list-filter-by-gi nodelist gilist);; Returns the node-list that contains every element of the original;; nodelist whose gi is in gilist(let loop ((result (empty-node-list)) (nl nodelist))(if (node-list-empty? nl)result(if (member (gi (node-list-first nl)) gilist)(loop (node-list result (node-list-first nl)) (node-list-rest nl))(loop result (node-list-rest nl))))))</programlisting>
</example>
</article>

n6hcm
01-25-2012, 04:33 AM
the web page shown is for plan9, not linux.

for linux go here: http://doc2txt.com/

apparently it's something you have to pay for.

yum will only find something if you've told it about the repository in which it lives.