/ CLI

Remove Consecutive Duplicate Lines With awk

I ran into an interesting problem yesterday. At some point, while scripting updates to a collection of repos, I managed to duplicate a few lines in several files. I ended up with something like this:

README.md
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# `dotfiles-role-javascript`
# `dotfiles-role-javascript`

[![Build Status](https://travis-ci.org/thecjharries/dotfiles-role-javascript.svg?branch=master)](https://travis-ci.org/thecjharries/dotfiles-role-javascript)
[![GitHub tag](https://img.shields.io/github/tag/thecjharries/dotfiles-role-javascript.svg)](https://github.com/thecjharries/dotfiles-role-javascript)
[![GitHub tag](https://img.shields.io/github/tag/thecjharries/dotfiles-role-javascript.svg)](https://github.com/thecjharries/dotfiles-role-javascript)

...

Finally, these variables must be set:

```yml
author_name
author_email
author_url
```

## Dependencies

```yml
---
- src: git+https://github.com/thecjharries/dotfiles-role-common-software.git
- src: git+https://github.com/thecjharries/dotfiles-role-common-software.git
- src: git+https://github.com/thecjharries/dotfiles-role-package-installer.git
- src: git+https://github.com/thecjharries/dotfiles-role-package-installer.git
```

I figured there had to be an easy solution using awk, so I grabbed the first SO thread I saw and ran with it.

$ awk '!seen[$0]++' README.md

This one's particularly opaque. Before using it, lets see how it works.

  • seen[$0] creates an entry in the seen associative array whose key is the current line, $0. seen isn't a magic array; it's just easy convention. qqq[$0] achieves the same results.
  • x++ post-increments the value. That means the value will stay the same for this operation, but increases immediately afterward.
  • !x negates the following statement, which, in this case, will stop awk from doing anything.

Normally awk prints every line. In this script, the first time awk sees a line, seen[$0] will be empty, so the post-increment will coerce it to a number after the operation completes. However, at the moment, it's empty, and the post-increment waits for any preceding operations, so the empty value is negated and then coerced to a number.

1
2
3
4
(!(seen[$0]))++
(!( ))++
(1)++
2

As clever as it is, it's got some major flaws, especially for my use case:

$ awk '!seen[$0]++' README.md > tmp.md
tmp.md
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# `dotfiles-role-javascript`

[![Build Status](https://travis-ci.org/thecjharries/dotfiles-role-javascript.svg?branch=master)](https://travis-ci.org/thecjharries/dotfiles-role-javascript)
[![GitHub tag](https://img.shields.io/github/tag/thecjharries/dotfiles-role-javascript.svg)](https://github.com/thecjharries/dotfiles-role-javascript)
...
Finally, these variables must be set:
```yml
author_name
author_email
author_url
```
## Dependencies
---
- src: git+https://github.com/thecjharries/dotfiles-role-common-software.git
- src: git+https://github.com/thecjharries/dotfiles-role-package-installer.git

It's removed the duplicate lines, empty lines, and necessary repeated elements (see how the second fenced block lost its fence). The whitespace is pretty easy to get back; empty lines won't have any fields, so Number of Fields will be empty. We can run awk, that is print a line, when NF is empty or the line hasn't been seen. In other words,

$ awk '!NF || !seen[$0]++' README.md > tmp2.md
tmp2.md
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# `dotfiles-role-javascript`

[![Build Status](https://travis-ci.org/thecjharries/dotfiles-role-javascript.svg?branch=master)](https://travis-ci.org/thecjharries/dotfiles-role-javascript)
[![GitHub tag](https://img.shields.io/github/tag/thecjharries/dotfiles-role-javascript.svg)](https://github.com/thecjharries/dotfiles-role-javascript)

...

Finally, these variables must be set:

```yml
author_name
author_email
author_url
```

## Dependencies

---
- src: git+https://github.com/thecjharries/dotfiles-role-common-software.git
- src: git+https://github.com/thecjharries/dotfiles-role-package-installer.git

As it turns out, I need a few lines to appear more than once, so common awk solutions don't work very well. My problem is really centered around evaluating each line against its immediate neighbors.

$ awk 'BEGIN{ old = "" } { new = $0 } old == new && old != "" { next } { old = $0; print }' README.md > tmp3.md
  • BEGIN{ old = "" } seeds old at file load, rather than at each line
  • { new = $0 } is run each line, updating the value of new
  • old == new && old != "" will be true only if the lines are equal and nonempty
  • { next } is fired if the conditional is true, skipping immediately to the next record (i.e. not printing the second, duplicated line)
  • { old = $0; print } will update the value of old and pass the line on to stdout
tmp3.md
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# `dotfiles-role-javascript`

[![Build Status](https://travis-ci.org/thecjharries/dotfiles-role-javascript.svg?branch=master)](https://travis-ci.org/thecjharries/dotfiles-role-javascript)
[![GitHub tag](https://img.shields.io/github/tag/thecjharries/dotfiles-role-javascript.svg)](https://github.com/thecjharries/dotfiles-role-javascript)

...

Finally, these variables must be set:

```yml
author_name
author_email
author_url
```

## Dependencies

```yml
---
- src: git+https://github.com/thecjharries/dotfiles-role-common-software.git
- src: git+https://github.com/thecjharries/dotfiles-role-package-installer.git
```

I'm still pretty new to awk scripting, so there might be a better way to do this. If there is, I'd love to know about it!

CJ Harries

I did a thing once. Change "blog." to "cj@" and you've got my email. All these opinions are mine and might not be shared by clients or employers.

Read More