File Formats: Text vs Binary

Introduction: At the End of the Day, It’s All Just Bytes

On disk, every file your system stores —source code, MP3s, PDFs, database blobs— is just a long sequence of bytes. The “text vs binary” distinction is really about how software interprets those bytes and what assumptions tools make when they open them.

What We Call a “Text File”

A text file is a file whose bytes are meant to represent characters in some character encoding such as ASCII or UTF‑8. In practice, that means the bytes mostly map to printable characters plus a small set of control characters like newlines.

When you open a .txt, .md, .csv, .json, or .yaml file, your editor decodes bytes into characters using an encoding it assumes or that you configured. If the bytes are valid for that encoding, you see readable text; if not, you see replacement symbols or garbage, most likely mojibake.

What We Call a “Binary File”

A binary file is any file whose bytes are not primarily meant to be interpreted as text characters, but instead follow some other structured format. Those bytes might represent pixels, audio samples, instructions, compressed data, offsets, and so on, according to a format specification.

Typical binary formats include images (PNG, JPEG), audio (MP3, WAV), video (MP4), executables, and many document formats such as PDF and some Office files. These often contain small textual chunks (like metadata strings) but their overall structure is not “a stream of characters”.

Common Ground: How Text and Binary Files Are Similar

Both text and binary files are just finite-length sequences of bytes ranging from 0 to 255; the operating systems (OS) don’t inherently distinguish the two. The meaning comes from the conventions and format specs that applications apply when reading those bytes.

Both categories exist because they solve different problems: representing human-readable information versus efficiently encoding structured or multimedia data. Either can embed the other —for example, base64-encoded binary inside JSON, or textual metadata chunks inside PNGs.

Differences That Matter in Practice

In a text file, tools assume an encoding and treat certain bytes specially as control characters, such as newline and sometimes tab. Many text-processing tools will choke on bytes that violate their assumptions, such as embedded null bytes.

In a binary file, there are no such text-specific constraints; any byte value can appear anywhere, and the application must understand the format to make sense of it. This is why opening an image or executable in a text editor usually produces garbled output instead of meaningful content.

Text File Formats You Use Every Day

Common text formats include:

Plain text (.txt) and documentation formats like Markdown (.md).
Data interchange formats such as CSV, JSON, and YAML.
Source code files in languages like Python, JavaScript, C, HTML, SQL, and configuration files for infrastructure and applications.

Engineers favor text formats for configs, logs, and APIs because they’re easy to inspect, diff in version control, review in pull requests, and manipulate with tools like grep, awk, and sed.

Binary File Formats in the Wild

Binary formats cover a broad range of everyday files:

Images: PNG, JPEG/JPG, GIF, WebP.
Audio and video: MP3, AAC, FLAC, MP4, MKV.
Documents and archives: PDF, many Office formats, ZIP, tarballs with compression.
Executables and object code: platform-specific binary formats and shared libraries.

These formats pack data efficiently and often include rich structure (headers, indexes, streams, metadata), which makes them suitable for large or complex content.

How Files Live on Disk: A Byte-Level Glimpse

At the filesystem level, a file is just a collection of bytes plus metadata like name, size, timestamps, and permissions. The filesystem does not inherently know whether a file is “text” or “binary”; that classification belongs to the software reading it.

Many binary formats start with a magic number or signature —a fixed pattern of bytes that helps tools recognize the format, such as PNG’s 89 50 4E 47 or PDF’s %PDF marker in bytes. File extensions are conventional hints; robust tools often inspect the contents instead.

Tools of the Trade: Text Editors vs Hex Editors

Text editors (Visual Studio Code, Vim, Emacs, Sublime Text, Notepad++, etc.) are optimized for editing character-based content. They assume some encoding, render characters, treat line endings specially, and provide features like syntax highlighting, search, and code-aware navigation.

Hex editors or binary editors (Hex Fiend, 010 Editor, HxD, or xxd combined with vim) show files as raw bytes, typically in hexadecimal alongside an ASCII/Unicode view. They expose byte offsets, numeric interpretations, and sometimes format-aware templates, making them ideal for inspecting magic numbers, understanding binary layouts, and doing low-level edits.

Advantages of Text Formats (and Their Hidden Costs)

Text formats shine when humans need to read, review, and modify data. They work well with diff and merge tools, age gracefully as tooling evolves, and are easy to inspect when debugging production issues over SSH.

The trade-offs: representing structured or numeric data as text can be larger on disk and slower to parse, and encodings plus line-ending differences can introduce subtle bugs, especially across platforms.

Advantages of Binary Formats (and Their Trade-offs)

Binary formats can represent complex or numeric data more compactly and closer to the in-memory representation, often improving I/O and parsing performance. They are essential for multimedia, high-volume data, and low-level system components where raw throughput matters.

However, they are opaque without the right tooling or documentation. Debugging issues requires specialized viewers or hex editors, and small corruption can make a file unreadable if checksums or structures no longer line up.

How to Choose: Text vs Binary in Your Own Designs

For configuration, infrastructure definitions, logs, and many external APIs, text formats tend to win because readability and evolvability are more important than raw performance. For large internal datasets, performance-critical serialization, or media encoding, binary formats are usually a better fit.

A common pattern is to use a text-based envelope (like JSON over HTTP) that references or carries binary payloads (files, blobs) when necessary, keeping the control plane human-friendly while the data plane stays efficient.

Closing Thoughts: Think in Bytes, Design for Humans

Aspect	Text formats	Binary formats
What they are	Bytes interpreted as characters in a text encoding, usually UTF-8	Bytes interpreted as structured, non-text data
Core similarity	Both are just sequences of bytes on disk	Both follow conventions/specs that software must understand
Typical examples	`.txt`, `.md`, `.csv`, `.json`, `.yaml`, source code	PNG, JPEG, MP3, MP4, PDF, ZIP, executables, database files
Primary use cases	Configs, logs, docs, data interchange, code	Multimedia, large datasets, executables, compact storage
How tools read them	Assume an encoding, parse characters and newlines	Parse headers, fields, offsets, possibly compressed sections
Common editors	VS Code, Vim, Emacs, Sublime, Notepad++	Hex Fiend, 010 Editor, HxD, `xxd` + vim
Human readability	Usually human-readable (if encoding matches)	Usually opaque; may contain small readable text fragments
Debuggability	Easy to inspect, diff, and review	Requires format docs or hex/binary tools to understand
Storage efficiency	Often larger and slower to parse for complex data	Often more compact and faster for structured/numeric data
Portability quirks	Encoding and line-ending issues across platforms	Endianness, alignment, versioned formats, tool compatibility
Main advantages	Readable, diffable, tool-friendly, long-term friendly	Compact, efficient, rich structure, better for multimedia
Main disadvantages	Can be verbose; encoding bugs and parsing overhead	Harder to inspect and debug; fragile without proper tooling

Everything eventually reduces to bytes, but how you choose to encode and expose those bytes determines how maintainable, debuggable, and performant your systems feel over time. Learning to view the same file both in a text editor and in a hex editor is a great way to build intuition about what those bytes really represent.