Regular Expressions: Specifying WHERE to Look

In this tutorial we examine ways of specifying WHERE in the source string we should look for a match.

Other aspects of pattern matching will be kept constant throughout the tutorial, namely:

"WHAT" will be a single character
"HOW MANY" will be one repetition of the text
"WHICH" occurrence will be the first occurrence.

Regular Expressions versus LIKE

Before regular expressions, pattern matching was typically done with the LIKE operator. When using this operator a pattern like 'abc' would only match 'abc', not '---abc' or '---abc---'. To match a string that contains 'abc' anywhere inside it, the pattern '%abc%' would be used.

execute set_target( 'abc'  )
execute add_target( '---abc' )
execute add_target( '---abc---' )

select 'abc' as pattern, target
from   targets
where  target like 'abc';

PATTERN         TARGET
--------------- ----------
abc             abc

select '%abc%' as pattern, target
from   targets
where  target like '%abc%';

PATTERN         TARGET
--------------- ----------
%abc%           abc
                ---abc
                ---abc---

Regular expressions take the opposite approach. The pattern 'abc' matches any target containing 'abc' and a special pattern like '^abc$' must be used to match only the target 'abc'.


select 'abc' as pattern, target
from   targets
where  regexp_like( target, 'abc' );

PATTERN         TARGET
--------------- ----------
abc             abc
                ---abc
                ---abc---


select '^abc$' as pattern, target
from   targets
where  regexp_like( target, '^abc$' );

PATTERN         TARGET
--------------- ----------
^abc$           abc

Matching the Beginning or End of a Line

As we saw in the tutorial HOW MANY, a target string like 'abc' is treated as a sequence of empty strings and characters.

(empty string) a (empty string) b (empty string) c (empty string)

A multi-line string with a Linefeed (LF) in it, like 'abc(LF)def' is treated as follows.

(empty string) a (empty string) b (empty string) c (empty string) (LF)
(empty string) d (empty string) e (empty string) f (empty string)

This treatment enables us to specify two special locations within the target, the beginning of lines and the end of lines.

By default the character '^' matches the first empty string in a target and the character '$' matches the last empty string in a target. In multi-line mode (initiated using the 'm' parameter in REGEXP conditions and functions) '^' matches the first empty string in the target *and* the first empty string after any Linefeed characters. The character '$' performs similarly for the last empty string in a target or line. The following examples illustrate this. The symbol (^) indicates an empty string matched by '^' and the symbol ($) indicates an empty string matched by '$'.

Default Mode:

(^)            a (empty string) b (empty string) c (empty string) (LF)
(empty string) d (empty string) e (empty string) f ($)

Multi-line Mode:

(^) a (empty string) b (empty string) c ($) (LF)
(^) d (empty string) e (empty string) f ($)

The following examples show how '^' and '$' operate in default mode. In the target data, 'unistr('\000A')' represents a Linefeed character and the target value 'abc(LF)def' appears as 'abc' on one line and 'def' on the next.

execute set_pattern( '^'  )
execute add_pattern( '$'  )
execute add_pattern( '^.' )
execute add_pattern( '.$' )
execute add_pattern( 'a' )
execute add_pattern( '^a' )
execute add_pattern( '^d' )
execute add_pattern( 'c$' )
execute add_pattern( 'f' )
execute add_pattern( 'f$' )

execute set_target( 'abcdef'  )
execute add_target( 'abc' || unistr('\000A') || 'def' )

set pagesize 60
set recsep   each

break on pattern

select
  pattern ,
  target ,
  regexp_substr( target, pattern ) as matched_value ,
  regexp_instr( target, pattern ) as position
from
  patterns ,
  targets
order by
  sort_key ,
  target desc
;

PATTERN         TARGET     MATCHED_VALUE   POSITION
--------------- ---------- --------------- --------
^               abcdef     (null)                 1

                abc        (null)                 1
                def

$               abcdef     (null)                 7

                abc        (null)                 8
                def

^.              abcdef     a                      1

                abc        a                      1
                def

.$              abcdef     f                      6

                abc        f                      7
                def

a               abcdef     a                      1

                abc        a                      1
                def

^a              abcdef     a                      1

                abc        a                      1
                def

^d              abcdef     (null)                 0

                abc        (null)                 0
                def

c$              abcdef     (null)                 0

                abc        (null)                 0
                def

f               abcdef     f                      6

                abc        f                      7
                def

f$              abcdef     f                      6

                abc        f                      7
                def

Using the same patterns in multi-line mode gives us these results.

select
  pattern ,
  target ,
  regexp_substr( target, pattern, 1, 1, 'm' ) as matched_value ,
  regexp_instr( target, pattern, 1, 1, 0, 'm' ) as position
from
  patterns ,
  targets
order by
  sort_key ,
  target desc
;

PATTERN         TARGET     MATCHED_VALUE   POSITION
--------------- ---------- --------------- --------
^               abcdef     (null)                 1

                abc        (null)                 1
                def

$               abcdef     (null)                 7

                abc        (null)                 4
                def

^.              abcdef     a                      1

                abc        a                      1
                def

.$              abcdef     f                      6

                abc        c                      3
                def

a               abcdef     a                      1

                abc        a                      1
                def

^a              abcdef     a                      1

                abc        a                      1
                def

^d              abcdef     (null)                 0

                abc        d                      5
                def

c$              abcdef     (null)                 0

                abc        c                      3
                def

f               abcdef     f                      6

                abc        f                      7
                def

f$              abcdef     f                      6

                abc        f                      7
                def

set recsep wrapped
break on pattern skip 1

For the six character target 'abcdef' note that REGEXP_INSTR returns '1' for both '^' and 'a'. Contrast this with the results for 'f' and '$' which are '6' and '7' respectively. One might expect REGEXP_INSTR to return "0" for '^', to be consistent with the behaviour of "$", but it cannot because '0' is used to indicate that no match was found. Be aware of this difference when using REGEXP_INSTR with '^' and '$'.

Starting the Search From Somewhere Other Than Position One

By default Oracle searches the target string starting from the first character in the target string. To start from a different position, REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR accept a "position" parameter to set the starting point.

Starting from position 1 (no position parameter is used):

execute set_target( 'abc1--abc2'  )

select
  target ,
  'abc.' as pattern ,
  regexp_substr( target, 'abc.' ) as matched_value ,
  regexp_instr( target, 'abc.' ) as position
from
  targets
;

TARGET     PATTERN         MATCHED_VALUE   POSITION
---------- --------------- --------------- --------
abc1--abc2 abc.            abc1                   1

Starting from position 5 (position parameter is set to "5"):

select
  target ,
  'abc.' as pattern ,
  regexp_substr( target, 'abc.', 5 ) as matched_value ,
  regexp_instr( target, 'abc.', 5 ) as position
from
  targets
;

TARGET     PATTERN         MATCHED_VALUE   POSITION
---------- --------------- --------------- --------
abc1--abc2 abc.            abc2                   7

Note that, regardless of which location you start the search at, REGEXP_INSTR always returns a position relative to the start of the entire target.

Unlike the three REGEXP functions, the REGEXP_LIKE condition does not accept a position parameter. The following examples show how to work around this. They all search for targets that have the string 'abc' anywhere at or after the fifth character in a string.

execute set_target( 'abc'  )
execute add_target( 'abc---def'  )
execute add_target( 'def---abc'  )

select target
from   targets
where  regexp_like( target, '^.{4}.*abc' ) ;

TARGET
----------
def---abc

select target
from   targets
where  regexp_like( substr( target, 5 ), 'abc' ) ;

TARGET
----------
def---abc

SQL Snippets ™: Tutorials for Oracle Developers