Searching \ for '[EE] Language suggestion for quickly parsing a tex' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/language/index.htm?key=language
Search entire site for: 'Language suggestion for quickly parsing a tex'.

Exact match. Not showing close matches.
PICList Thread
'[EE] Language suggestion for quickly parsing a tex'
2007\09\04@190412 by Vitaliy

flavicon
face
Hi All,

I have some experience in C and Object Pascal, but I get the feeling that
they're not the best languages for parsing text. A web developer who once
worked here commented that a C program we were using for parsing a CSV file
(grab fields, rearrange them, do basic sanity checks) would take a fraction
of the effort to code in PHP.

What other "text-friendly" languages are out there, besides PHP and Python?
Which one do you prefer to use for simple parsing?

Vitaliy

2007\09\04@192112 by Alex Harford

face picon face
On 9/4/07, Vitaliy <spam_OUTspamTakeThisOuTspammaksimov.org> wrote:
>
> What other "text-friendly" languages are out there, besides PHP and Python?
> Which one do you prefer to use for simple parsing?

Ruby, perl, lua to name a few.

Python is my favourite though. :)

2007\09\04@192246 by Herbert Graf

flavicon
face
On Tue, 2007-09-04 at 16:03 -0700, Vitaliy wrote:
> Hi All,
>
> I have some experience in C and Object Pascal, but I get the feeling that
> they're not the best languages for parsing text. A web developer who once
> worked here commented that a C program we were using for parsing a CSV file
> (grab fields, rearrange them, do basic sanity checks) would take a fraction
> of the effort to code in PHP.
>
> What other "text-friendly" languages are out there, besides PHP and Python?
> Which one do you prefer to use for simple parsing?

Perl. Some of the text stuff Perl's got is so powerful it boggles the
mind. Unfortunately it's not always the easiest to figure out what's
going on for a piece of code.

TTYL

2007\09\04@195931 by Marcel Birthelmer

picon face
Specifically for CSV parsing, I quite like awk. It's a very "mature"
product/language of course, but the syntax is basically C with some
perl-style declarations.
As for general-purpose text parsing, Perl is nice, but the fact that
it's pretty unmaintainable, even with good comments, makes it much
less useful in the long run. I haven't used python's regex abilities
yet, but that's probably where I would check first these days.
Rgds,
- Marcel

2007\09\04@202356 by Jake Anderson

flavicon
face
Vitaliy wrote:
> Hi All,
>
> I have some experience in C and Object Pascal, but I get the feeling that
> they're not the best languages for parsing text. A web developer who once
> worked here commented that a C program we were using for parsing a CSV file
> (grab fields, rearrange them, do basic sanity checks) would take a fraction
> of the effort to code in PHP.
>
> What other "text-friendly" languages are out there, besides PHP and Python?
> Which one do you prefer to use for simple parsing?
>
> Vitaliy
>
>  
Depending on the quality of your text files mysql has a csv engine. IE
point mysql at your csv file and call it a database, then you can just
"select" the data how you want it.
My first python program was a CSV parser, about 40 lines, processed a
30gb csv file at the drives data rate on a P4 3ghz. (About 60-70% cpu).

2007\09\04@231925 by John Chung

picon face
I have used AWK for quite a while now. It is language
designed for quick parsing and formatting for text
files. It was perhaps the first true programming
language catered for text manipulation. Anyway once
you learn it you would always used it for text
manipulation since is it quick to use for text
applications. Yes it is more readable compared to
PERL..... Still PERL can do a lot more than text
manipulation.

John

PS: AWK takes a bit of time *maybe a month* to fully
understand the capabilitiy of the program.

I do recommend this book.
http://www.cs.bell-labs.com/cm/cs/awkbook/index.html


--- Marcel Birthelmer <.....marcelb.listsKILLspamspam@spam@gmail.com> wrote:

{Quote hidden}

> --

2007\09\05@023621 by Hector Martin

flavicon
face
Python is pretty good. Perl is faster to code for if you're doing a
quick hack, since you can abbreviate a lot and do much of the dirty work
with regular expressions, but it also makes the code more unreadable. If
you just need to change a few things in a text file, use Perl or even
sed. If you need to add more logic in, I'd say use Python. It depends on
what you're comfortable with, though - I've seen a lot of complex stuff
written in Perl, but I prefer Python myself.

For text parsing, look into the built-in Python string functions and
methods (split, join, etc), and the re module for regular expressions.
Python also has modules for specific file formats, such as CSV:

http://docs.python.org/lib/module-csv.html

Vitaliy wrote:
{Quote hidden}

--
Hector Martin (hectorspamKILLspammarcansoft.com)
Public Key: http://www.marcansoft.com/marcan.asc

2007\09\05@033335 by William \Chops\ Westfield

face picon face

>> What other "text-friendly" languages are out there

Well, you can still get TECO.  (and whatever happened to
SNOBOL ?)

It's not a completely off the wall idea; for pure text
processing, you could do worse than an mlisp program run
in emacs, or some other sort of macro in some other editor.

BillW

2007\09\05@075055 by Rolf

face picon face
Vitaliy wrote:
> Hi All,
>
> I have some experience in C and Object Pascal, but I get the feeling that
> they're not the best languages for parsing text. A web developer who once
> worked here commented that a C program we were using for parsing a CSV file
> (grab fields, rearrange them, do basic sanity checks) would take a fraction
> of the effort to code in PHP.
>
> What other "text-friendly" languages are out there, besides PHP and Python?
> Which one do you prefer to use for simple parsing?
>
> Vitaliy
>
>  
Hi Vitaliy

I have a *lot* of experience in dealing with CSV files. I work in the
finance (banking, insurance) world, and CSV is a significant data
transfer mechanism. My caution to you is to understand that CSV does not
just involve commas....

Perl is considered to be a fantastic language for text processing, but,
for the most part, that is because it has such a powerful, and easy to
access RegExp library. Many banks use perl to process CSV. Things like
"@parts = split /,/, $line" are commonplace, but, are also broken! CSV
has a complicated quoting mechanism that is not possible to process in a
single regular expression (I believe). Remember, the following line is
valid XML:

Matthew,"Simon, also known as ""Peter""","John, ""The Baptist""","Some
values are quoted even though they have no comma",some are not.

and it resolves to the values:

Matthew
Simon, also known as "Peter"
John, "The Baptist"
Some values are quoted even though they have no comma
some are not.

Because of the complicated quoting mechanism, it gets 'fun'.

In Perl, use the Text::CSV module to process CSV data, it's much simpler.

Likewise, in whatever language you choose, find some library, module, or
plugin that does the grunt work for you.

We do most of our work in Java now, and we have a home-built CSV library
that we use for all our stuff now. It is much simpler. Unfortunately,
it's proprietary.

Regardless, I caution you to consider the full gammut of "CSV" before
you start down a process in one language only to find that the simple
cases are easy, but the outside cases can take a re-write to implement.

If your data is guaranteed to be simple, then any language with a regexp
library will be suitable, but, if you are going to have slightly more
complexist, you should ensure that your favourite lahnguage has a
convenient (and usable) library, or be prepared to write some handling
for the complex stuff.

Rolf

2007\09\05@095111 by John Ferrell

face picon face
IMHO, the language used to handle CSV files is not all that important.
What is important is having a library of CSV tools.  Most  Pascal & c
compilers have at least a start in this direction but normally refer to them
as "String" operations.

Far down my list of "things to do" is refresh my memory on the subject in
Pascal but having access to Excel has made me lazy when dealing with CSV
files...

John Ferrell    W8CCW
"Life is easier if you learn to plow
      around the stumps"
http://DixieNC.US

----- Original Message -----
From: "Vitaliy" <.....spamKILLspamspam.....maksimov.org>
To: "piclist" <EraseMEPICLISTspam_OUTspamTakeThisOuTmit.edu>
Sent: Tuesday, September 04, 2007 7:03 PM
Subject: [EE] Language suggestion for quickly parsing a text file


> Hi All,
>
> I have some experience in C and Object Pascal, but I get the feeling that
> they're not the best languages for parsing text.


2007\09\05@095315 by wouter van ooijen

face picon face
> What other "text-friendly" languages are out there, besides
> PHP and Python?
> Which one do you prefer to use for simple parsing?

For very simple work (select lines 5-20, or all lines between the lines
'hello' and 'world') AWK is nice. For more demanding work you could try
the old man SNOBOL, but it laks any program structure. I loved the
build-in language for the EVE editor (on VAX/VMS), but I don't kown any
ports to XP or Linux (Jan-Erik, you know any?). For the rest it is not
the language but the available libraries that determine the ease of
processing text. A good regular expression library will ease your work a
lot, and the tricks learned this way are language independent. I prefer
(of course) Python.

Wouter van Ooijen

-- -------------------------------------------
Van Ooijen Technische Informatica: http://www.voti.nl
consultancy, development, PICmicro products
docent Hogeschool van Utrecht: http://www.voti.nl/hvu



2007\09\05@102523 by Peter P.

picon face
Vitaliy <spam <at> maksimov.org> writes
> What other "text-friendly" languages are out there, besides PHP and Python?
> Which one do you prefer to use for simple parsing?

For simple formats such as CSV even a shell is powerful enough. The question is,
what do you need to do with the data after you parse it. If it is anything other
than fixed-record-count per line, audited data then please consider using a more
evolved language than awk or shell or even php.

Perl, TCL and Python are the usual suspects here. Of these three Python has the
steepest learning curve, followed by Perl. TCL is a simple shell language that
is actually a form of LISP that was well hidden with syntactic sugar. It is also
GUI capable and cross platform, and programs written in it tend to be terse and
readable. Of course I am biased here. Just as an example I give here an example
of a CSV parser in tcl that loads a file into a list-of-lists structure, and can
cope with such problems as missing columns, misplaced separators etc (you get to
fix that later).

--start tcl code--

set FN "thisfile.csv"

# read the file
set fd [open $FN "r"]; set rawdata [read $fd]; close $fd

# parse the file and create output in $parsed
set parsed {}
foreach lin [split $rawdata "\n"] { lappend parsed [split $lin ","] }

--end tcl code--

Peter P.


2007\09\05@110911 by Mark Rages

face picon face
On 9/5/07, Peter P. <plpeter2006spamspam_OUTyahoo.com> wrote:
> Vitaliy <spam <at> maksimov.org> writes
> > What other "text-friendly" languages are out there, besides PHP and Python?
> > Which one do you prefer to use for simple parsing?
>
> For simple formats such as CSV even a shell is powerful enough. The question is,
> what do you need to do with the data after you parse it. If it is anything other
> than fixed-record-count per line, audited data then please consider using a more
> evolved language than awk or shell or even php.
>

CSV is not so simple. Quoting from http://www.faqs.org/docs/artu/ch05s02.html:

" In fact, the Microsoft version of CSV is a textbook example of how
not to design a textual file format. Its problems begin with the case
in which the separator character (in this case, a comma) is found
inside a field. The Unix way would be to simply escape the separator
with a backslash, and have a double escape represent a literal
backslash. This design gives us a single special case (the escape
character) to check for when parsing the file, and only a single
action when the escape is found (treat the following character as a
literal). The latter conveniently not only handles the separator
character, but gives us a way to handle the escape character and
newlines for free. CSV, on the other hand, encloses the entire field
in double quotes if it contains the separator. If the field contains
double quotes, it must also be enclosed in double quotes, and the
individual double quotes in the field must themselves be repeated
twice to indicate that they don't end the field.

" The bad results of proliferating special cases are twofold. First,
the complexity of the parser (and its vulnerability to bugs) is
increased. Second, because the format rules are complex and
underspecified, different implementations diverge in their handling of
edge cases. Sometimes continuation lines are supported, by starting
the last field of the line with an unterminated double quote — but
only in some products! Microsoft has incompatible versions of CSV
files between its own applications, and in some cases between
different versions of the same application (Excel being the obvious
example here)."
--
Mark Rages, Engineer
Midwest Telecine LLC
@spam@markragesKILLspamspammidwesttelecine.com

2007\09\05@111931 by Tony Smith

picon face
> IMHO, the language used to handle CSV files is not all that important.
> What is important is having a library of CSV tools.  Most  
> Pascal & c compilers have at least a start in this direction
> but normally refer to them as "String" operations.
>
> Far down my list of "things to do" is refresh my memory on
> the subject in Pascal but having access to Excel has made me
> lazy when dealing with CSV files...


Excel gets it wrong too, but that's more a case of people not generating CSV
files properly in the first place.  As Rolf pointed out, it's not that
simple.  His example may have been a single item, which would have needed
another set of quotes to encompass it all.

I usually do TAB delimited  :)  You may have to Google 'recursive descent
parser' for some ideas, the general idea is you keep shoving stuff onto a
stack until you find the 'end', then you work back up.  (It's been years
since I've done this.)

It gets tricky with something like "x","","""x""" (really x,,"x").  Hitting
a single quote means you need to find a closing quote, and then a comma.
Hitting a double quote means either a blank string (yeah, people do that),
or it's an escaped single quote, the 3rd item is literally "x".  The
following chars set the context.  Toss it on the stack, check the next
char(s), then decide.

Here's something to ponder: """",",","""","," - what's that?

People can't even get apostrophes right in SQL.  (Have a look at some lyrics
site, they just drop them!)

Tony

2007\09\06@184219 by Peter P.

picon face
Mark Rages <markrages <at> gmail.com> writes:
> CSV is not so simple. Quoting from http://www.faqs.org/docs/artu/ch05s02.html:

The OP said that he wanted to parse a CSV file, not a m$ file. For m$ files
OpenOffice has an import handler that does the right thing. Once the file is in
a table it can be processed at will and re-exported in many formats (including
CSV).

The day when I'll have to wrap my single neuron around the commercial 'quality'
things that company makes I'll start running Loose again.

Peter P.


More... (looser matching)
- Last day of these posts
- In 2007 , 2008 only
- Today
- New search...