Previous: Record Splitting with Standard awk, Up: How Input Is Split into Records [Contents][Index]
gawkWhen using gawk, the value of RS is not limited to a
one-character string. If it contains more than one character, it is
treated as a regular expression
(see Regular Expressions). (c.e.)
In general, each record
ends at the next string that matches the regular expression; the next
record starts at the end of the matching string. This general rule is
actually at work in the usual case, where RS contains just a
newline: a record ends at the beginning of the next matching string (the
next newline in the input), and the following record starts just after
the end of this string (at the first character of the following line).
The newline, because it matches RS, is not part of either record.
When RS is a single character, RT
contains the same single character. However, when RS is a
regular expression, RT contains
the actual input text that matched the regular expression.
If the input file ends without any text matching RS,
gawk sets RT to the null string.
The following example illustrates both of these features.
It sets RS equal to a regular expression that
matches either a newline or a series of one or more uppercase letters
with optional leading and/or trailing whitespace:
$ echo record 1 AAAA record 2 BBBB record 3 |
> gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
> { print "Record =", $0,"and RT = [" RT "]" }'
-| Record = record 1 and RT = [ AAAA ] -| Record = record 2 and RT = [ BBBB ] -| Record = record 3 and RT = [ -| ]
The square brackets delineate the contents of RT, letting you
see the leading and trailing whitespace. The final value of
RT is a newline.
See A Simple Stream Editor for a more useful example
of RS as a regexp and RT.
If you set RS to a regular expression that allows optional
trailing text, such as ‘RS = "abc(XYZ)?"’, it is possible, due
to implementation constraints, that gawk may match the leading
part of the regular expression, but not the trailing part, particularly
if the input text that could match the trailing part is fairly long.
gawk attempts to avoid this problem, but currently, there’s
no guarantee that this will never happen.
|
Caveats When Using Regular Expressions for
RS
Remember that in Record splitting with regular expressions works differently than
regexp matching with the |
The use of RS as a regular expression and the RT
variable are gawk extensions; they are not available in
compatibility mode
(see Command-Line Options).
In compatibility mode, only the first character of the value of
RS determines the end of the record.
mawk has allowed RS to be a regexp for decades.
As of October, 2019, BWK awk also supports it. Neither
version supplies RT, however.
RS = "\0" Is Not Portable
There are times when you might want to treat an entire data file as a
single record. The only way to make this happen is to give You might think that for text files, the NUL character, which
consists of a character with all bits equal to zero, is a good
value to use for BEGIN { RS = "\0" } # whole file becomes one record?
Almost all other It happens that recent versions of See Reading a Whole File at Once for an interesting way to read
whole files. If you are using |
Previous: Record Splitting with Standard awk, Up: How Input Is Split into Records [Contents][Index]