Searching \ for 'Number base specifications (was: Code error)' in subject line. ()
Make payments with PayPal - it's fast, free and secure! Help us get a faster server
FAQ page: www.piclist.com/techref/index.htm?key=code+error
Search entire site for: 'Number base specifications (was: Code error)'.

Truncated match.
PICList Thread
'Number base specifications (was: Code error)'
1997\02\20@053735 by Clyde Smith-Stubbs

flavicon
face
Thus spake Andy Kunz (spam_OUTmontanaTakeThisOuTspamFAST.NET):

> Oh wait, the Rockwell assembler (Apple ][) did use "$" I think.
>
> $ would then be context sensitive, because it's the standard to denote
> "current PC" as in
>         goto    $-1
>
> for an infinite loop.  I never liked making compilers with context-sensitive
> tokens myself.

It's not context-sensitive as the term is normally used. If you regard
that as context-sensitive, then so is 0 - on its own it's a number - when
it comes after a letter e.g. M0 then it's part of a symbol. Context
sensitivity refers to stuff like the IntelSoft 8086 assembler syntax,
where the instruction

       mov     ax,fred

could mean either direct addressing, or immediate addressing, i.e. either
move the contents of the location 'fred', or move the address of the
location 'fred'. The distinction depends on the previous type of declaration
of the symbol 'fred'. This, in an assembler, is a Bad Thing, in my view. But
the notion of context-sensitivity is not usually applied at the lexical
level, where individual characters are grouped into tokens.

There is, regrettably, a veritable plethora of different syntaxes (should that
be syntaces, I wonder?) for number-base specification. As Andy suggested, the
only sensible solution is to support all of them.

Clyde


--
Clyde Smith-Stubbs    | HI-TECH Software,       | Voice: +61 7 3354 2411
.....clydeKILLspamspam@spam@htsoft.com      | P.O. Box 103, Alderley, | Fax:   +61 7 3354 2422
http://www.htsoft.com | QLD, 4051, AUSTRALIA.   |
---------------------------------------------------------------------------
Download a FREE beta version of our new ANSI C compiler for the PIC
microcontroller! Point your WWW browser at http://www.htsoft.com/

1997\02\20@055225 by Andy Kunz

flavicon
face
At 08:35 PM 2/20/97 +1000, you wrote:
>Thus spake Andy Kunz (montanaspamKILLspamFAST.NET):
>
>> Oh wait, the Rockwell assembler (Apple ][) did use "$" I think.
>>
>> $ would then be context sensitive, because it's the standard to denote
>> "current PC" as in
>>         goto    $-1
>>
>> for an infinite loop.  I never liked making compilers with context-sensitive
>> tokens myself.
>
>It's not context-sensitive as the term is normally used. If you regard
>that as context-sensitive, then so is 0 - on its own it's a number - when
>it comes after a letter e.g. M0 then it's part of a symbol. Context
>sensitivity refers to stuff like the IntelSoft 8086 assembler syntax,
>where the instruction

I was approaching "context sensitive" from the point of the
compiler/assembler writer.

In compiler design, one frequently is able to consider alpha characters as
starting identifiers, math operators as such, etc.  Numerics are "alpha" if
they are immediately following an alpha.  THis is usually very easily
handled by the tokenizer in the first pass of the compiler.

It's when the tokenizer can't figure out what a character is as soon as it
sees it that things get touchy.  For example, if we were to support "$" as
both denoting a hex value and the PC, we have to "look ahead" to see the
next character before we know what to do.

       movlw   $0c
       goto    $-1

If the language you are writing your compiler in supports look-ahead
functionality (some do not!) for file I/O, it isn't a problem.  You can GET
and UNGET.

What's funny is that when reading post notation on types, you need to read
the value as a string and then parse it when you reach the type.  That's a
pain, too.  That's one thing where C is much better than many other languages.

Your 8086 assembler example is a perfect reason why people should use C.  <G>

Andy
==================================================================
Andy Kunz - Montana Design - 409 S 6th St - Phillipsburg, NJ 08865
         Hardware & Software for Industry & R/C Hobbies
       "Go fast, turn right, and keep the wet side down!"
==================================================================

1997\02\20@061603 by Clyde Smith-Stubbs

flavicon
face
Thus spake Andy Kunz (.....montanaKILLspamspam.....FAST.NET):

> It's when the tokenizer can't figure out what a character is as soon as it
> sees it that things get touchy.  For example, if we were to support "$" as
> both denoting a hex value and the PC, we have to "look ahead" to see the
> next character before we know what to do.
>
>         movlw   $0c
>         goto    $-1

Yes, but single-character lookahead is required anyway - otherwise you would
not even be able to parse 1+1. It's best to
implement your own level of get and unget rather than relying on ungetc()
or equivalent. Apart from anything else, it makes life much easier if
you read a whole line of input at one time.

Lexical analysis can be tricky at times; consider the following
fragment of C code:

char    a[] = {
       0x0E+1,
       0x0D+2,
};

One of these is a legal expression in ANSI C, the other is not.

Cheers, Clyde

--
Clyde Smith-Stubbs    | HI-TECH Software,       | Voice: +61 7 3354 2411
EraseMEclydespam_OUTspamTakeThisOuThtsoft.com      | P.O. Box 103, Alderley, | Fax:   +61 7 3354 2422
http://www.htsoft.com | QLD, 4051, AUSTRALIA.   |
---------------------------------------------------------------------------
Download a FREE beta version of our new ANSI C compiler for the PIC
microcontroller! Point your WWW browser at http://www.htsoft.com/

1997\02\20@211647 by John Payson

picon face
> Lexical analysis can be tricky at times; consider the following
> fragment of C code:
>
> char    a[] = {
>         0x0E+1,
>         0x0D+2,
> };
>
> One of these is a legal expression in ANSI C, the other is not.

What is the problem here?  The only thing I can see which is questionable is
the extra comma after the second expression; I believe, however, that the ANSI
standard requires that compilers accept (and ignore) a comma following the
last item in an initializer.  Is there something else which I should be notic-
ing?

1997\02\21@015013 by Clyde Smith-Stubbs

flavicon
face
Thus spake John Payson (supercatspamspam_OUTMCS.NET):

> > Lexical analysis can be tricky at times; consider the following
> > fragment of C code:
> >
> > char    a[] = {
> >         0x0E+1,
> >         0x0D+2,
> > };
> >
> > One of these is a legal expression in ANSI C, the other is not.
>
> What is the problem here?  The only thing I can see which is questionable is
> the extra comma after the second expression; I believe, however, that the ANSI
> standard requires that compilers accept (and ignore) a comma following the
> last item in an initializer.  Is there something else which I should be notic-
> ing?

Yes, there is (but it requires a careful reading of the ANSI standard
to figure it out).

Now first up, since sometimes I seem to be a little too obtuse in my postings,
let me make it clear that I don't like what the ANSI standard says about
this, i.e. I think the standard is broken in this regard. So there's no point
in arguing with me about it - I just happen to have been caught by a compiler
(gcc) that is very literal in its interpretation of the standard. But it
IS the standard so it's neither wrong nor right, it just is.

The ANSI standard for C separates the lexical analysis and parsing of a C
program in several stages; the first happens at the pre-processor level,
and it is at this stage that all tokenization occurs, i.e. once a program
has been converted to a stream of tokens, those tokens are indivisible by
later stages. The tokenization is specified by a grammar, and one of the
grammar elements is a non-terminal called a 'pp-number', which is meant
to represent anything in C that will later be interpreted as a number.

Incidentally, it is not required by the standard that a compiler actually
implement things in these stages, but it must behave as if it does (this
is known as the "as-if" rule).

Basically a pp-number is anything that starts with a digit, followed by
any sequence of digits, upper and lower case letters a-f and x, or
a decimal point, then optionally followed by a plus or minus sign
and one or more decimal digits. I may have the exact details wrong,
but that's the gist of it. This caters for decimal numbers (123),
hex numbers (0x123) and real numbers (0.123, 123e+10).

But the way the grammar is written, 0x0E+1 is a pp-number! So
at the lexical level, the sequence 0x0E+1 becomes a single token,
whereas 0x0D+2 is split into a pp-number (0x0D), a plus sign, and another
pp-number (2). Now when you get to the next stage, where numbers are
actually parsed into hex, decimal or whatever, this token (0x0E+1)
is not a legal number. So your program will not compile.

Now being a subjective, illogical human being, I can look at 0x0E+1
and know perfectly well what the programmer meant. But a strictly-
conforming ANSI compiler is not allowed to be that clever.

So if you want to add a constant to 0x0E, leave a space before the +
sign, i.e.

       0x0E +1

is fine.

--
Clyde Smith-Stubbs    | HI-TECH Software,       | Voice: +61 7 3354 2411
@spam@clydeKILLspamspamhtsoft.com      | P.O. Box 103, Alderley, | Fax:   +61 7 3354 2422
http://www.htsoft.com | QLD, 4051, AUSTRALIA.   |
---------------------------------------------------------------------------
Download a FREE beta version of our new ANSI C compiler for the PIC
microcontroller! Point your WWW browser at http://www.htsoft.com/

More... (looser matching)
- Last day of these posts
- In 1997 , 1998 only
- Today
- New search...