Regular Expressions

Specifying WHERE to Look

In this tutorial we examine ways of specifying WHERE in the source string we should look for a match.

Other aspects of pattern matching will be kept constant throughout the tutorial, namely:

  • "WHAT" will be a single character
  • "HOW MANY" will be one repetition of the text
  • "WHICH" occurrence will be the first occurrence.

Regular Expressions versus LIKE

Before regular expressions, pattern matching was typically done with the LIKE operator. When using this operator a pattern like 'abc' would only match 'abc', not '---abc' or '---abc---'. To match a string that contains 'abc' anywhere inside it, the pattern '%abc%' would be used.

execute set_target( 'abc'  )
execute add_target( '---abc' )
execute add_target( '---abc---' )

select 'abc' as pattern, target
from   targets
where  target like 'abc';
 
PATTERN         TARGET
--------------- ----------
abc             abc
 
select '%abc%' as pattern, target
from   targets
where  target like '%abc%';
 
PATTERN         TARGET
--------------- ----------
%abc%           abc
                ---abc
                ---abc---
 

Regular expressions take the opposite approach. The pattern 'abc' matches any target containing 'abc' and a special pattern like '^abc$' must be used to match only the target 'abc'.


select 'abc' as pattern, target
from   targets
where  regexp_like( target, 'abc' );
 
PATTERN         TARGET
--------------- ----------
abc             abc
                ---abc
                ---abc---
 

select '^abc$' as pattern, target
from   targets
where  regexp_like( target, '^abc$' );
 
PATTERN         TARGET
--------------- ----------
^abc$           abc
 

Matching the Beginning or End of a Line

As we saw in the tutorial HOW MANY, a target string like 'abc' is treated as a sequence of empty strings and characters.

(empty string) a (empty string) b (empty string) c (empty string)
 

A multi-line string with a Linefeed (LF) in it, like 'abc(LF)def' is treated as follows.

(empty string) a (empty string) b (empty string) c (empty string) (LF)
(empty string) d (empty string) e (empty string) f (empty string)
 

This treatment enables us to specify two special locations within the target, the beginning of lines and the end of lines.

By default the character '^' matches the first empty string in a target and the character '$' matches the last empty string in a target. In multi-line mode (initiated using the 'm' parameter in REGEXP conditions and functions) '^' matches the first empty string in the target *and* the first empty string after any Linefeed characters. The character '$' performs similarly for the last empty string in a target or line. The following examples illustrate this. The symbol (^) indicates an empty string matched by '^' and the symbol ($) indicates an empty string matched by '$'.

Default Mode:

(^)            a (empty string) b (empty string) c (empty string) (LF)
(empty string) d (empty string) e (empty string) f ($)
 

Multi-line Mode:

(^) a (empty string) b (empty string) c ($) (LF)
(^) d (empty string) e (empty string) f ($)
 

The following examples show how '^' and '$' operate in default mode. In the target data, 'unistr('\000A')' represents a Linefeed character and the target value 'abc(LF)def' appears as 'abc' on one line and 'def' on the next.

execute set_pattern( '^'  )
execute add_pattern( '$'  )
execute add_pattern( '^.' )
execute add_pattern( '.$' )
execute add_pattern( 'a' )
execute add_pattern( '^a' )
execute add_pattern( '^d' )
execute add_pattern( 'c$' )
execute add_pattern( 'f' )
execute add_pattern( 'f$' )

execute set_target( 'abcdef'  )
execute add_target( 'abc' || unistr('\000A') || 'def' )

set pagesize 60
set recsep   each

break on pattern

select
  pattern ,
  target ,
  regexp_substr( target, pattern ) as matched_value ,
  regexp_instr( target, pattern ) as position
from
  patterns ,
  targets
order by
  sort_key ,
  target desc
;
 
PATTERN         TARGET     MATCHED_VALUE   POSITION
--------------- ---------- --------------- --------
^               abcdef     (null)                 1

                abc        (null)                 1
                def

$               abcdef     (null)                 7

                abc        (null)                 8
                def

^.              abcdef     a                      1

                abc        a                      1
                def

.$              abcdef     f                      6

                abc        f                      7
                def

a               abcdef     a                      1

                abc        a                      1
                def

^a              abcdef     a                      1

                abc        a                      1
                def

^d              abcdef     (null)                 0

                abc        (null)                 0
                def

c$              abcdef     (null)                 0

                abc        (null)                 0
                def

f               abcdef     f                      6

                abc        f                      7
                def

f$              abcdef     f                      6

                abc        f                      7
                def
 

Using the same patterns in multi-line mode gives us these results.

select
  pattern ,
  target ,
  regexp_substr( target, pattern, 1, 1, 'm' ) as matched_value ,
  regexp_instr( target, pattern, 1, 1, 0, 'm' ) as position
from
  patterns ,
  targets
order by
  sort_key ,
  target desc
;
 
PATTERN         TARGET     MATCHED_VALUE   POSITION
--------------- ---------- --------------- --------
^               abcdef     (null)                 1

                abc        (null)                 1
                def

$               abcdef     (null)                 7

                abc        (null)                 4
                def

^.              abcdef     a                      1

                abc        a                      1
                def

.$              abcdef     f                      6

                abc        c                      3
                def

a               abcdef     a                      1

                abc        a                      1
                def

^a              abcdef     a                      1

                abc        a                      1
                def

^d              abcdef     (null)                 0

                abc        d                      5
                def

c$              abcdef     (null)                 0

                abc        c                      3
                def

f               abcdef     f                      6

                abc        f                      7
                def

f$              abcdef     f                      6

                abc        f                      7
                def
 
set recsep wrapped
break on pattern skip 1
 

For the six character target 'abcdef' note that REGEXP_INSTR returns '1' for both '^' and 'a'. Contrast this with the results for 'f' and '$' which are '6' and '7' respectively. One might expect REGEXP_INSTR to return "0" for '^', to be consistent with the behaviour of "$", but it cannot because '0' is used to indicate that no match was found. Be aware of this difference when using REGEXP_INSTR with '^' and '$'.

Starting the Search From Somewhere Other Than Position One

By default Oracle searches the target string starting from the first character in the target string. To start from a different position, REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR accept a "position" parameter to set the starting point.

Starting from position 1 (no position parameter is used):

execute set_target( 'abc1--abc2'  )

select
  target ,
  'abc.' as pattern ,
  regexp_substr( target, 'abc.' ) as matched_value ,
  regexp_instr( target, 'abc.' ) as position
from
  targets
;
 
TARGET     PATTERN         MATCHED_VALUE   POSITION
---------- --------------- --------------- --------
abc1--abc2 abc.            abc1                   1
 

Starting from position 5 (position parameter is set to "5"):

select
  target ,
  'abc.' as pattern ,
  regexp_substr( target, 'abc.', 5 ) as matched_value ,
  regexp_instr( target, 'abc.', 5 ) as position
from
  targets
;
 
TARGET     PATTERN         MATCHED_VALUE   POSITION
---------- --------------- --------------- --------
abc1--abc2 abc.            abc2                   7
 

Note that, regardless of which location you start the search at, REGEXP_INSTR always returns a position relative to the start of the entire target.

Unlike the three REGEXP functions, the REGEXP_LIKE condition does not accept a position parameter. The following examples show how to work around this. They all search for targets that have the string 'abc' anywhere at or after the fifth character in a string.

execute set_target( 'abc'  )
execute add_target( 'abc---def'  )
execute add_target( 'def---abc'  )

select target
from   targets
where  regexp_like( target, '^.{4}.*abc' ) ;
 
TARGET
----------
def---abc
 
select target
from   targets
where  regexp_like( substr( target, 5 ), 'abc' ) ;
 
TARGET
----------
def---abc
 



Linking to SQL Snippets ™

To link to this page in Oracle Technology Network Forums or OraFAQ Forums cut and paste this code.

  • [url=http://www.sqlsnippets.com/en/topic-10217.html]SQL Snippets: Regular Expressions - Specifying WHERE to Look[/url]

To link to this page in HTML documents or Blogger comments cut and paste this code.

  • <a href="http://www.sqlsnippets.com/en/topic-10217.html">SQL Snippets: Regular Expressions - Specifying WHERE to Look</a>

To link to this page in other web sites use the following values.

  • Link Text : SQL Snippets: Regular Expressions - Specifying WHERE to Look
  • URL (href): http://www.sqlsnippets.com/en/topic-10217.html