Searching \ for '[PIC] Regular Expressions' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/microchip/devices.htm?key=pic
Search entire site for: 'Regular Expressions'.

Exact match. Not showing close matches.
PICList Thread
'[PIC] Regular Expressions'
2009\05\27@183404 by Harold Hallikainen

face
flavicon
face
After the recent discussion of regular expressions, I, of course, today
ran in to an application for them, but I have to do this in C30 for a
PIC24. I need to change a relative URL (given the base URL) to an absolute
URL. Everything I'm finding is pretty much using regular expressions in
perl or PHP. Anyone have any other ideas?

I'll search on...

Thanks!

Harold



--
FCC Rules Updated Daily at http://www.hallikainen.com - Advertising
opportunities available!

2009\05\27@185759 by solarwind

picon face
On Wed, May 27, 2009 at 6:43 PM, Harold Hallikainen
<spam_OUTharoldTakeThisOuTspamhallikainen.org> wrote:
> After the recent discussion of regular expressions, I, of course, today
> ran in to an application for them, but I have to do this in C30 for a
> PIC24. I need to change a relative URL (given the base URL) to an absolute
> URL. Everything I'm finding is pretty much using regular expressions in
> perl or PHP. Anyone have any other ideas?

Sure. Regex.

You already have your solution, why search on? What's wrong with
regex? You can, of course, parse the string the good ol' way by hand,
but that's what regex was invented for.

2009\05\27@185938 by Tamas Rudnai

face picon face
Is not string manipulation is not applicable in your case?

A very simplistic way:
1. Check if the URL starts with "" or by the given base URL
2. If not, copy the base URL and then the relative URL with a '/' separator

But things can be much more complicated than this... Relative URL can
contain leading slashes to indicate to starting from the wwwroot, or "../"
or something like that to relative from the current directory etc. You may
also need to handle these too but I do not think if there is anything which
cannot be handled by normal string manipulation.

You may can also try to port a regex engine and then adopting the Perl
examples you found, however, it may need much more resources than would
easily fit into your chip -- I have never tried this one so someone else may
have a better thoughts on it.

Tamas


On Wed, May 27, 2009 at 11:43 PM, Harold Hallikainen <.....haroldKILLspamspam@spam@hallikainen.org
{Quote hidden}

> -

2009\05\27@193423 by Rolf

flavicon
face
Harold Hallikainen wrote:
> After the recent discussion of regular expressions, I, of course, today
> ran in to an application for them, but I have to do this in C30 for a
> PIC24. I need to change a relative URL (given the base URL) to an absolute
> URL. Everything I'm finding is pretty much using regular expressions in
> perl or PHP. Anyone have any other ideas?
>
> I'll search on...
>
> Thanks!
>
> Harold
>
>
>
>  
RegEx libraries are not the sort of thing you port just for one place to
use it.... the libraries are huge. The Java 'Pattern' class is in excess
of 5.5K lines, and uses many other classes... and it needs the matcher,
and other classes to get things to work. I would be somwhat surprised if
a mostly complete regex engine could fit in any PIC... and the effort
involved in harvesting a subset of regex functionality would be daunting....

I am pretty sure that your effort justification for porting regex to C30
will fall short.

There are other ways to do it... and, in this case, I am sure they are
simpler.

What is the actual requirement... your description is somewhat vague...
but I am certain a relatively concise function will suffice given the
relative structure of URL's.

Rolf

2009\05\27@205607 by solarwind

picon face
tiny-rex.sourceforge.net/

2009\05\27@213645 by Dave Tweed

face
flavicon
face
Rolf wrote:
> Harold Hallikainen wrote:
> > After the recent discussion of regular expressions, I, of course, today
> > ran in to an application for them, but I have to do this in C30 for a
> > PIC24. I need to change a relative URL (given the base URL) to an absolute
> > URL. Everything I'm finding is pretty much using regular expressions in
> > perl or PHP. Anyone have any other ideas?
>
> RegEx libraries are not the sort of thing you port just for one place to
> use it.... the libraries are huge. The Java 'Pattern' class is in excess
> of 5.5K lines, and uses many other classes... and it needs the matcher,
> and other classes to get things to work. I would be somwhat surprised if
> a mostly complete regex engine could fit in any PIC... and the effort
> involved in harvesting a subset of regex functionality would be daunting....

Wow, talk about code bloat! Although I'm sure it's wonderful....

As a counterexample, I have the full source for a UNIX-style 'grep' command
(dating from 1985) that clocks in at just over 400 lines of ordinary C code
(9KB), including command-line handling, RE compiler, RE evaluator and
results formatting. The x86 executable (for DOS) is 11KB. I'm sure this
would port quite readily to C30.

Of course, this is just the pattern matcher, and doesn't include any text
replacement functionality. But if you make the right enhancements to the
evaluation engine to capture the beginning and ending points of matches,
adding replacement should be fairly straightforward.

-- Dave Tweed

2009\05\27@214821 by Rolf

flavicon
face
solarwind wrote:
> http://tiny-rex.sourceforge.net/
>  
Yes, and... have you compiled it in C30 to see how much program and data
memory are actually required to implement it? Sure the source code is
only 17K (ecxcluding imported libraries), but how small does it compile
(some PIC24F's are smaller than 4K program instructions, and the largest
is less than 86K instructions ..)

Would it be easier/faster/more reliable/smaller/etc. to write a one-off
custom function to do a very particular string manipulation...

Regexes have their place, but carrying a vast library in an embedded
system to perform a single trivial task strikes me as counter-intuitive.

Rolf

2009\05\27@220102 by Rolf

flavicon
face
Dave Tweed wrote:
{Quote hidden}

Dave.

Of course you are right... (for the record, the java code deals with
multi-byte characters, unicode, etc... which significantly compounds
problems... as well as all the grouping, substitution, positive and
negative look-aheads, look-behinds, etc....)...

My initial reaction was excessive.

With a PIC24 I can see there being space for a regex engine... but, I
imagine that any PIC24 using URL's will also need the TCP/IP libraries
too....

Now, to pick, or choose ;-)

Remember, the biggest PIC24's have 85K instructions available, and even
solarwind's contribution of the trex library, though small, does only
the easy part of the process (the matching, not replacing), and that is
17K of source code and uses external libraries.

I'll back off and see whether a regex library actually fits in a spot
small enough for a real, and useful program to fit alongside it in a PIC24.

On the other hand, given that google is very bare of references for
PIC's and regex, that I imagine very few people have accomplished such a
task.

Rolf


2009\05\28@002815 by Harold Hallikainen

face
flavicon
face

> What is the actual requirement... your description is somewhat vague...
> but I am certain a relatively concise function will suffice given the
> relative structure of URL's.

This is part of a closed captioning system for digital cinema. I get a
Resource Presentation List from the cinema server. It's an XML file that
includes the URLs for each of the caption files (typically one for each
reel and language). These URLs are typically absolute, but may be relative
to the URL of the RPL. So, I need a function where I can pass in my new
URL, a base URL, and get back an absolute representation of the new URL.
So, I think I need to be able to handle relative URLs that include such
things as

./file
../file
../../path/path/file
/path/file

etc.

I'm then passing the absolute URL to an http client I wrote that uses the
Microchip TCP/IP stack to go get the required files.

Thanks for the comments so far!

Harold



--
FCC Rules Updated Daily at http://www.hallikainen.com - Advertising
opportunities available!

2009\05\28@073131 by Gerhard Fiedler

picon face
Harold Hallikainen wrote:

>> What is the actual requirement... your description is somewhat vague...
>> but I am certain a relatively concise function will suffice given the
>> relative structure of URL's.
>
> This is part of a closed captioning system for digital cinema. I get a
> Resource Presentation List from the cinema server. It's an XML file that
> includes the URLs for each of the caption files (typically one for each
> reel and language). These URLs are typically absolute, but may be relative
> to the URL of the RPL. So, I need a function where I can pass in my new
> URL, a base URL, and get back an absolute representation of the new URL.
> So, I think I need to be able to handle relative URLs that include such
> things as
>
> ./file
> ../file
> ../../path/path/file
> /path/file

Not really a problem, I think.

<basePath>/./file
<basePath>/../file
<basePath>/../../path/path/file
<basePath>//path/file

These should all work fine. (Note that duplication of the slash is
generally not a problem. Depends on the file system of the server, but
at least with Windows and Linux systems this works.)

That's what a relative path is: it's the path portion after the base
path. So I don't really understand what you think you may need to parse.
Just append the relative path to the base path (with a trailing slash,
e.g. "http://myserver/myTopLevelDir/") and you should be done.

To find out whether a path is relative or absolute depends on what kind
of paths you can expect on the input. If it's either a complete http URL
(starting with "http:") or a relative path, then that's it: check for a
starting "http:".

It seems to me that a regex parser is a bit overboard for this :)

Gerhard

2009\05\28@074605 by Tamas Rudnai

face picon face
On Thu, May 28, 2009 at 12:31 PM, Gerhard Fiedler <
listsspamKILLspamconnectionbrazil.com> wrote:

> It seems to me that a regex parser is a bit overboard for this :)
>

I could not say it better!

Tamas
--
http://www.mcuhobby.com

2009\05\28@083009 by Harold Hallikainen

face
flavicon
face

{Quote hidden}

OK, that's interesting. I'll give it a try. I'm using something similar to
the example in Microchip's stack. In the example, they do a Google search.
I'm, instead, passing in the URL. I'll try just appending the relative URL
and see what happens. I guess it's really up to the http server as to how
it handles the GET request. By the way, I see a lot of requests for
../../../../etc/passwd on my server logs. I have a script that blocks the
IP address of people that try to do that (along with a bunch of other
things).

Thanks!

Harold

--
FCC Rules Updated Daily at http://www.hallikainen.com - Advertising
opportunities available!

2009\05\28@090057 by Tamas Rudnai

face picon face
On Thu, May 28, 2009 at 1:39 PM, Harold Hallikainen
<.....haroldKILLspamspam.....hallikainen.org>wrote:

> By the way, I see a lot of requests for
> ../../../../etc/passwd on my server logs. I have a script that blocks the
> IP address of people that try to do that (along with a bunch of other
> things).
>

On a secure system they won't get anything particular from that file -- if
your server configured correctly you cannot get any files that does not
belong to the wwwroot. And of course on most modern unix/linux you will have
shadow passwords so even if they could get that file they will not able to
do an offline dictionary or brute force attack. They could get the user
names out of it so they could try to do it online, but again then your
system should be able to block these. And of course you should never enter
real names and phone numbers or any valuable information to the passwd file
so that they will be hard to do the old style of social engeneering either.

BTW: What are you doing with the IP addresses that are coming from a
provider that gives the IPs dynamically to their users?

Tamas
--
http://www.mcuhobby.com

2009\05\28@094525 by Dave Tweed

face
flavicon
face
Gerhard Fiedler wrote:
> Not really a problem, I think.
>
> <basePath>/./file
> <basePath>/../file
> <basePath>/../../path/path/file
> <basePath>//path/file
>
> These should all work fine. (Note that duplication of the slash is
> generally not a problem. Depends on the file system of the server, but
> at least with Windows and Linux systems this works.)

Actually, it doesn't (or at least it shouldn't) depend on the filesystem at
all, only on the web server software. Only an extremely primitive server
would pass URL paths directly to the filesystem without doing some serious
sanitizing of its own first -- it's a huge security hole, otherwise.

> That's what a relative path is: it's the path portion after the base
> path. So I don't really understand what you think you may need to parse.
> Just append the relative path to the base path (with a trailing slash,
> e.g. "http://myserver/myTopLevelDir/") and you should be done.
>
> To find out whether a path is relative or absolute depends on what kind
> of paths you can expect on the input. If it's either a complete http URL
> (starting with "http:") or a relative path, then that's it: check for a
> starting "http:".

The presence or absence of the optional scheme field does not determine
whether a URL path is absolute or relative -- it's the leading slash after
that (single or double) that tells you.

> It seems to me that a regex parser is a bit overboard for this :)

Yes.

What's actually needed here is a specific "URL parser" function that can
separate a URL into its components, any of which are optional:

  scheme, user, password, host, port, path, item, fragment, query

i.e.,

  scheme://user:password@host:port/path/item#fragment?query

along with an indication of whether the path is absolute or relative.

Note that some of these components are specific to the "http" scheme.
Note also that most documentation (e.g., the RFCs) do not mention what
I'm calling "item" here (the last name in the path) -- but in most cases,
it's significant enough to call it out separately.

For simplicity, in what follows, I'll just use the word "server" to refer
to the collection of user, password, host and port (RFC2396 calls this the
"authority component"), and "item" will include any fragment or query.

In the completely general case, you process three items in sequence to
figure out what the pieces of the final URL need to be:

1. the URL of the original XML file

  This gives you a default scheme, server, and path; ignore the item
  (although you may need it if the partial URL begins with "#").

2. the base URL given inside the XML file
  (in the general case, this may be optional)

  This may or may not update the scheme and/or server components.
  In any case, if it's absolute, you replace the original path component;
  otherwise, you combine the two together.

3. the "partial" URL for a new item given inside the XML file

  This gives you at a minimum the item (which might be just a fragment).
  This may or may not also update the scheme and/or server components.
  And again, if it's absolute, you replace the original path component;
  otherwise, you combine the path given here with the results of the
  previous step.

Only then can you put the components back together correctly in order to
create the full absolute URL for the new item.

I'd still let the server worry about any /../ or /./ in the final path.

-- Dave Tweed

2009\05\28@095258 by Gerhard Fiedler

picon face
Harold Hallikainen wrote:

>> <basePath>/./file
>> <basePath>/../file
>> <basePath>/../../path/path/file
>> <basePath>//path/file

> I guess it's really up to the http server as to how it handles the GET
> request.

Exactly. Normal file system rules apply.

> By the way, I see a lot of requests for ../../../../etc/passwd on my
> server logs.

A typical web server won't service any requests for files that are above
its web root directory. That's why "<basePath>/../../path/path/file"
from the examples above assumes that basePath contains at least two
directory levels after the server name.

Gerhard

2009\05\28@103112 by Gerhard Fiedler

picon face
Dave Tweed wrote:

>> To find out whether a path is relative or absolute depends on what
>> kind of paths you can expect on the input. If it's either a complete
>> http URL (starting with "http:") or a relative path, then that's it:
>> check for a starting "http:".
>
> The presence or absence of the optional scheme field does not
> determine whether a URL path is absolute or relative -- it's the
> leading slash after that (single or double) that tells you.

That depends on what the specific spec of the OP determines. It could
very well be that a path "/NormallyConsideredAbsolutePath/file.ext" is
considered a "relative" path, relative to a base path, say
"http://myserver/" (or even relative to
"http://myserver/MyTopLevelDir/"). It all depends on the specific
situation and how the "relative" paths are created.

This also determines to what depths he needs to go in parsing the paths.

Gerhard

2009\05\28@105350 by Harold Hallikainen

face
flavicon
face
> On Thu, May 28, 2009 at 1:39 PM, Harold Hallikainen
> <EraseMEharoldspam_OUTspamTakeThisOuThallikainen.org>wrote:
>
>> By the way, I see a lot of requests for
>> ../../../../etc/passwd on my server logs. I have a script that blocks the
>> IP address of people that try to do that (along with a bunch of other
things).
>
> On a secure system they won't get anything particular from that file --
if
> your server configured correctly you cannot get any files that does not
belong to the wwwroot. And of course on most modern unix/linux you will
have
> shadow passwords so even if they could get that file they will not able
to
> do an offline dictionary or brute force attack. They could get the user
names out of it so they could try to do it online, but again then your
system should be able to block these. And of course you should never
enter
> real names and phone numbers or any valuable information to the passwd file
> so that they will be hard to do the old style of social engeneering either.
>
> BTW: What are you doing with the IP addresses that are coming from a
provider that gives the IPs dynamically to their users?
>
> Tamas


I think leases tend to be fairly long term (a month or more). I'm blocking
the IP for a month. While Apache properly prevents access to /etc/passwd
(and I am using the shadow password file), attempts to access this show an
attempt to break in to the server, so I block them. I also see a fair
number of attempts to get at MSOffice (which is certainly not on my Fedora
server), so I block those also. In general, I'm blocking anything that
appears to be a break in attempt. I block for about a month, then let them
try again.

Recent blocks are:
6:16 am                123.27.127.224 being blocked because of authentication ...
Wed, 5:10 pm          218.1.64.133 being blocked because of ../../../../...
Wed, 4:36 pm          212.34.140.136 being blocked because of Failed pas...
Wed, 4:13 pm          194.8.74.124 being blocked because of NeuerKomment...
Wed, 8:55 am          66.249.71.237 being blocked because of NeuerKommen...
Wed, 7:55 am          81.88.124.30 being blocked because of CONNECT
Tue, 10:59 pm          124.128.83.222 being blocked because of authentication ...
Tue, 9:30 am          61.142.208.164 being blocked because of Failed pas...
Sun, 11:53 pm          61.172.243.233 being blocked because of Failed pas...
Sun, 8:46 pm          65.55.109.167 being blocked because of WikiBlogPlu...
Sun, 4:51 pm          67.202.2.132 being blocked because of Failed passw...
Sun, 3:59 pm          213.92.8.21 being blocked because of authentication ...
Sun, 9:16 am          202.100.219.81 being blocked because of authentication ...
Sun, 4:26 am          61.6.65.252 being blocked because of authentication ...
Sat, 5:54 pm          63.217.29.66 being blocked because of NeuerKomment...

The NeuerKomment and WikiBlogPlugin are attempts at wiki spam, so I block
them. The Failed Password are ssh login attempts. Before I started running
these scripts, I'd find thousands of failed logins in reviewing the logs
each morning. Now there are maybe 5 or 10.

Harold

--
FCC Rules Updated Daily at http://www.hallikainen.com - Advertising
opportunities available!



2009\05\28@114622 by Harold Hallikainen

face
flavicon
face

{Quote hidden}

This is what I'm afraid of... The proposed standard just says the URL can
be absolute or relative. It does not limit how convoluted it may be.

In evaluating URLs, I'd always thought that // preceded the host, and /
preceded the path or file. In a relative URL, we should never see //. If
the URL starts with /, we're starting at the base or root of the server
directory tree. If it starts with anything else, we're starting from the
path to the file this was referenced in. I just tried accessing a URL by
adding the relative URL to the previous URL, and it seems to work. This
will make it a lot easier than my trying to make the URL absolute. My test
URL was
www.hallikainen.com/FccRules/2009/36/382/../../../2008/36/382/index.php
.

Thanks for the help!

Harold



--
FCC Rules Updated Daily at http://www.hallikainen.com - Advertising
opportunities available!

2009\05\28@134005 by Gerhard Fiedler

picon face
Harold Hallikainen wrote:

{Quote hidden}

Convolution towards the end doesn't have to bother you. The only thing
that matters is the beginning.

Dave seemed to imply that a relative URL may include a scheme or a
server (the "user:password@host:port" part he mentioned). I'm not sure
this is correct. If it isn't, you're still back to the simplicity of my
post a few posts back: if it starts with a scheme ("http:" in your case)
you consider it absolute, if not, consider it relative.

However, if your base URL contains a path part in addition to the server
(like "http://myserver/basepath/"), things become a bit ambiguous. What
means a "relative" URL of the type "/path1/path2/file.ext", in the
context of this base URL? Would this expand to
<http://myserver/basepath/path1/path2/file.ext> (accepting the base URL
as base for all paths that don't specify a server) or to
<http://myserver/path1/path2/file.ext> (accepting the idea that a
starting slash means the top level directory that is accessible on the
specified server)? I don't think that there is a standard for this, so
it has to be specified for this application.

So you need a spec of some kind... Either restricting the base URL to
URLs that don't contain a path part, or defining what it means if the
base URL contains a path and the "relative" URL starts with a slash.

> In evaluating URLs, I'd always thought that // preceded the host, ...

AFAIK it does, but it may also be part of the path.

> ... and / preceded the path or file.

Not really -- it's a separator, and if not preceded by a path part means
the root.

> In a relative URL, we should never see //.

Not at the beginning, but in the middle of the path part it's perfectly
legal.

> If the URL starts with /, we're starting at the base or root of the
> server directory tree.

Actually, if it starts with /, you're referencing the local file system.
Unless, of course, it is in the context of some base URL -- but then it
depends on that base URL.

> I just tried accessing a URL by adding the relative URL to the
> previous URL, and it seems to work. This will make it a lot easier
> than my trying to make the URL absolute. My test URL was
> http://www.hallikainen.com/FccRules/2009/36/382/../../../2008/36/382/index.php

Yes, this is perfectly normal. So are

www.hallikainen.com/FccRules/2009/36/382/..//../../2008/36/382/index.php
http://www.hallikainen.com/FccRules/2009/36/382/.././../../2008/36/382/index.php

Gerhard

2009\05\28@140119 by Tamas Rudnai

face picon face
On Thu, May 28, 2009 at 6:39 PM, Gerhard Fiedler <listsspamspam_OUTconnectionbrazil.com
> wrote:

> Dave seemed to imply that a relative URL may include a scheme or a
> server (the "user:password@host:port" part he mentioned). I'm not sure
> this is correct. If it isn't, you're still back to the simplicity of my
> post a few posts back: if it starts with a scheme ("http:" in your case)
> you consider it absolute, if not, consider it relative.
>

I am not sure either, however, this was a bit weird to see with FF -- I
passed this URL to the file:// protocol and a directory list appeared...

file://user:password@home

Note that my root directory was listed instead of the home... So I guess in
this case it treats the 'home' as a host name instead of a directory name
and as the protocol file:// does not know anything about passwords and host
names it just cuts it off and passes the rest:

Index of file://user:password@home/

Show hidden objects
Name     Size     Last Modified
bin         04/05/09     22:23:55
boot         19/05/09     19:58:19
cdrom         04/05/09     21:37:45
dev         28/05/09     18:24:06
etc ............(and so on)

And yes, file://user:password@home/home lists my home dir...

Tamas
--
http://www.mcuhobby.com

2009\05\28@140426 by Gerhard Fiedler

picon face
Harold Hallikainen wrote:

> This is what I'm afraid of... The proposed standard just says the URL
> can be absolute or relative. It does not limit how convoluted it may
> be.

FWIW, here's "the" standard :) <http://tools.ietf.org/html/rfc3986>

Section 4 basically says that a URI must start with a scheme, and a
relative (URI) reference mustn't. So if you have "http:" at the
beginning, you have an absolute (http) URI, and if not, it's a relative
reference.

In section 5 they define the resolution. I think you probably can
extract a simple enough subset for your application to make simple
appending, maybe with a very few conditional clauses, viable.

Gerhard

2009\05\28@142910 by Dave Tweed

face
flavicon
face
Gerhard Fiedler wrote:
> Dave seemed to imply that a relative URL may include a scheme or a
> server (the "user:password@host:port" part he mentioned). I'm not sure
> this is correct.

Sorry, I was trying to be succinct, but lost some clarity in the process.

"http:path/item.html" is a perfectly valid partial URL that overrides the
scheme, but is still relative to the current document's path (or the base
path, if given), not absolute.

"http:/path/item.html" is an absolute path, but relative to the default
server's root.

"http://www.server.com/path/item.html" is another absolute path, to a
particular server. Note that it is not possible to specify a server without
using an absolute path on that server.

The same thing applies even if the scheme is not explicitly given in the
partial URL:

"path/item.html" - relative

"/path/item.html" - absolute on default server

"//http://www.server.com/path/item.html" - absolute on explicit server

> However, if your base URL contains a path part in addition to the server
> (like "http://myserver/basepath/"), things become a bit ambiguous. What
> means a "relative" URL of the type "/path1/path2/file.ext", in the
> context of this base URL? Would this expand to
> <http://myserver/basepath/path1/path2/file.ext> (accepting the base URL
> as base for all paths that don't specify a server)

No.

> or to <http://myserver/path1/path2/file.ext> (accepting the idea that a
> starting slash means the top level directory that is accessible on the
> specified server)?

Yes.

> > In a relative URL, we should never see //.
>
> Not at the beginning, but in the middle of the path part it's perfectly
> legal.

Yes.

> > If the URL starts with /, we're starting at the base or root of the
> > server directory tree.
>
> Actually, if it starts with /, you're referencing the local file system.

No, unless the scheme is switched to "file" instead of "http".

Note, your browser may be making that switch for you in the interest of
convenience, but that's not part of the specification.

-- Dave Tweed

More... (looser matching)
- Last day of these posts
- In 2009 , 2010 only
- Today
- New search...