ChatGPT解决这个技术问题 Extra ChatGPT

Using PowerShell to write a file in UTF-8 without the BOM

Out-File seems to force the BOM when using UTF-8:

$MyFile = Get-Content $MyPath
$MyFile | Out-File -Encoding "UTF8" $MyPath

How can I write a file in UTF-8 with no BOM using PowerShell?

Update 2021

PowerShell has changed a bit since I wrote this question 10 years ago. Check multiple answers below, they have a lot of good information!

BOM = Byte-Order Mark. Three chars placed at the beginning of a file (0xEF,0xBB,0xBF) that look like ""
This is incredibly frustrating. Even third party modules get polluted, like trying to upload a file over SSH? BOM! "Yeah, let's corrupt every single file; that sounds like a good idea." -Microsoft.
The default encoding is UTF8NoBOM starting with Powershell version 6.0 docs.microsoft.com/en-us/powershell/module/…
Talk about breaking backwards compatibility...
I feel like it should be noted that while a BOM in a UTF-8 file does make a lot of systems choke, it is explicitly valid in the Unicode UTF-8 spec to include one.

X
XDS

Using .NET's UTF8Encoding class and passing $False to the constructor seems to work:

$MyRawString = Get-Content -Raw $MyPath
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
[System.IO.File]::WriteAllLines($MyPath, $MyRawString, $Utf8NoBomEncoding)

Ugh, I hope that's not the only way.
One line [System.IO.File]::WriteAllLines($MyPath, $MyFile) is enough. This WriteAllLines overload writes exactly UTF8 without BOM.
Note that WriteAllLines seems to require $MyPath to be absolute.
@xdhmoore WriteAllLines gets the current directory from [System.Environment]::CurrentDirectory. If you open PowerShell and then change your current directory (using cd or Set-Location), then [System.Environment]::CurrentDirectory will not be changed and the file will end up being in the wrong directory. You can work around this by [System.Environment]::CurrentDirectory = (Get-Location).Path.
C
Community

The proper way as of now is to use a solution recommended by @Roman Kuzmin in comments to @M. Dudley answer:

[IO.File]::WriteAllLines($filename, $content)

(I've also shortened it a bit by stripping unnecessary System namespace clarification - it will be substituted automatically by default.)


This (for whatever reason) did not remove the BOM for me, where as the accepted answer did
@Liam, probably some old version of PowerShell or .NET?
I believe older versions of the .NET WriteAllLines function did write the BOM by default. So it could be a version issue.
Confirmed with writes with a BOM in Powershell 3, but without a BOM in Powershell 4. I had to use M. Dudley's original answer.
So it works on Windows 10 where it's installed by default. :) Also, suggested improvement: [IO.File]::WriteAllLines(($filename | Resolve-Path), $content)
L
Lenny

I figured this wouldn't be UTF, but I just found a pretty simple solution that seems to work...

Get-Content path/to/file.ext | out-file -encoding ASCII targetFile.ext

For me this results in a utf-8 without bom file regardless of the source format.


This worked for me, except I used -encoding utf8 for my requirement.
Thank you very much. I am working with dump logs of a tool - which had tabs inside it. UTF-8 was not working. ASCII solved the problem. Thanks.
Yes, -Encoding ASCII avoids the BOM problem, but you obviously only get 7-bit ASCII characters. Given that ASCII is a subset of UTF-8, the resulting file is technically also a valid UTF-8 file, but all non-ASCII characters in your input will be converted to literal ? characters.
Warning: Definitely not. This deletes all non-ASCII characters and replaces them with question marks. Don't do this or you will lose data! (Tried with PS 5.1 on Windows 10)
m
mklement0

Note: This answer applies to Windows PowerShell; by contrast, in the cross-platform PowerShell Core edition (v6+), UTF-8 without BOM is the default encoding, across all cmdlets.

In other words: If you're using PowerShell [Core] version 6 or higher, you get BOM-less UTF-8 files by default (which you can also explicitly request with -Encoding utf8 / -Encoding utf8NoBOM, whereas you get with-BOM encoding with -utf8BOM).

If you're running Windows 10 and you're willing to switch to BOM-less UTF-8 encoding system-wide - which can have side effects - even Windows PowerShell can be made to use BOM-less UTF-8 consistently - see this answer.

To complement M. Dudley's own simple and pragmatic answer (and ForNeVeR's more concise reformulation):

For convenience, here's advanced function Out-FileUtf8NoBom, a pipeline-based alternative that mimics Out-File, which means:

you can use it just like Out-File in a pipeline.

input objects that aren't strings are formatted as they would be if you sent them to the console, just like with Out-File.

an additional -UseLF switch allows you transform Windows-style CRLF newlines to Unix-style LF-only newlines.

Example:

(Get-Content $MyPath) | Out-FileUtf8NoBom $MyPath # Add -UseLF for Unix newlines

Note how (Get-Content $MyPath) is enclosed in (...), which ensures that the entire file is opened, read in full, and closed before sending the result through the pipeline. This is necessary in order to be able to write back to the same file (update it in place).
Generally, though, this technique is not advisable for 2 reasons: (a) the whole file must fit into memory and (b) if the command is interrupted, data will be lost.

A note on memory use:

M. Dudley's own answer requires that the entire file contents be built up in memory first, which can be problematic with large files.

The function below improves on this only slightly: all input objects are still buffered first, but their string representations are then generated and written to the output file one by one.

Source code of function Out-FileUtf8NoBom:

Note: The function is also available as an MIT-licensed Gist, and only it will be maintained going forward.

You can install it directly with the following command (while I can personally assure you that doing so is safe, you should always check the content of a script before directly executing it this way):

# Download and define the function.
irm https://gist.github.com/mklement0/8689b9b5123a9ba11df7214f82a673be/raw/Out-FileUtf8NoBom.ps1 | iex
function Out-FileUtf8NoBom {
<#
.SYNOPSIS
  Outputs to a UTF-8-encoded file *without a BOM* (byte-order mark).
.DESCRIPTION
  Mimics the most important aspects of Out-File:
    * Input objects are sent to Out-String first.
    * -Append allows you to append to an existing file, -NoClobber prevents
      overwriting of an existing file.
    * -Width allows you to specify the line width for the text representations
       of input objects that aren't strings.
  However, it is not a complete implementation of all Out-File parameters:
    * Only a literal output path is supported, and only as a parameter.
    * -Force is not supported.
    * Conversely, an extra -UseLF switch is supported for using LF-only newlines.
  Caveat: *All* pipeline input is buffered before writing output starts,
          but the string representations are generated and written to the target
          file one by one.
.NOTES
  The raison d'être for this advanced function is that Windows PowerShell
  lacks the ability to write UTF-8 files without a BOM: using -Encoding UTF8 
  invariably prepends a BOM.
  Copyright (c) 2017, 2020 Michael Klement <mklement0@gmail.com> (http://same2u.net), 
  released under the [MIT license](https://spdx.org/licenses/MIT#licenseText).
#>

  [CmdletBinding()]
  param(
    [Parameter(Mandatory, Position=0)] [string] $LiteralPath,
    [switch] $Append,
    [switch] $NoClobber,
    [AllowNull()] [int] $Width,
    [switch] $UseLF,
    [Parameter(ValueFromPipeline)] $InputObject
  )

  #requires -version 3

  # Convert the input path to a full one, since .NET's working dir. usually
  # differs from PowerShell's.
  $dir = Split-Path -LiteralPath $LiteralPath
  if ($dir) { $dir = Convert-Path -ErrorAction Stop -LiteralPath $dir } else { $dir = $pwd.ProviderPath}
  $LiteralPath = [IO.Path]::Combine($dir, [IO.Path]::GetFileName($LiteralPath))

  # If -NoClobber was specified, throw an exception if the target file already
  # exists.
  if ($NoClobber -and (Test-Path $LiteralPath)) {
    Throw [IO.IOException] "The file '$LiteralPath' already exists."
  }

  # Create a StreamWriter object.
  # Note that we take advantage of the fact that the StreamWriter class by default:
  # - uses UTF-8 encoding
  # - without a BOM.
  $sw = New-Object System.IO.StreamWriter $LiteralPath, $Append

  $htOutStringArgs = @{}
  if ($Width) {
    $htOutStringArgs += @{ Width = $Width }
  }

  # Note: By not using begin / process / end blocks, we're effectively running
  #       in the end block, which means that all pipeline input has already
  #       been collected in automatic variable $Input.
  #       We must use this approach, because using | Out-String individually
  #       in each iteration of a process block would format each input object
  #       with an indvidual header.
  try {
    $Input | Out-String -Stream @htOutStringArgs | % { 
      if ($UseLf) {
        $sw.Write($_ + "`n") 
      }
      else {
        $sw.WriteLine($_) 
      }
    }
  } finally {
    $sw.Dispose()
  }

}

u
user2864740

Starting from version 6 powershell supports the UTF8NoBOM encoding both for set-content and out-file and even uses this as default encoding.

So in the above example it should simply be like this:

$MyFile | Out-File -Encoding UTF8NoBOM $MyPath

Nice. FYI check version with $PSVersionTable.PSVersion
Worth noting that in PowerShell [Core] v6+ -Encoding UTF8NoBOM is never required, because it is the default encoding.
L
Lucero

When using Set-Content instead of Out-File, you can specify the encoding Byte, which can be used to write a byte array to a file. This in combination with a custom UTF8 encoding which does not emit the BOM gives the desired result:

# This variable can be reused
$utf8 = New-Object System.Text.UTF8Encoding $false

$MyFile = Get-Content $MyPath -Raw
Set-Content -Value $utf8.GetBytes($MyFile) -Encoding Byte -Path $MyPath

The difference to using [IO.File]::WriteAllLines() or similar is that it should work fine with any type of item and path, not only actual file paths.


Nice - works great with strings (which may be all that is needed and certainly meets the requirements of the question). In case you need to take advantage of the formatting that Out-File, unlike Set-Content, provides, pipe to Out-String first; e.g., $MyFile = Get-ChildItem | Out-String
j
jamhan

This script will convert, to UTF-8 without BOM, all .txt files in DIRECTORY1 and output them to DIRECTORY2

foreach ($i in ls -name DIRECTORY1\*.txt)
{
    $file_content = Get-Content "DIRECTORY1\$i";
    [System.IO.File]::WriteAllLines("DIRECTORY2\$i", $file_content);
}

This one fails without any warning. What version of powershell should I use to run it?
The WriteAllLines solution works great for small files. However, I need a solution for larger files. Every time I try to use this with a larger file I'm getting an OutOfMemory error.
A
Andreas Covidiot

important!: this only works if an extra space or newline at the start is no problem for your use case of the file (e.g. if it is an SQL file, Java file or human readable text file)

one could use a combination of creating an empty (non-UTF8 or ASCII (UTF8-compatible)) file and appending to it (replace $str with gc $src if the source is a file):

" "    |  out-file  -encoding ASCII  -noNewline  $dest
$str  |  out-file  -encoding UTF8   -append     $dest

as one-liner

replace $dest and $str according to your use case:

$_ofdst = $dest ; " " | out-file -encoding ASCII -noNewline $_ofdst ; $src | out-file -encoding UTF8 -append $_ofdst

as simple function

function Out-File-UTF8-noBOM { param( $str, $dest )
  " "    |  out-file  -encoding ASCII  -noNewline  $dest
  $str  |  out-file  -encoding UTF8   -append     $dest
}

using it with a source file:

Out-File-UTF8-noBOM  (gc $src),  $dest

using it with a string:

Out-File-UTF8-noBOM  $str,  $dest

optionally: continue appending with Out-File: "more foo bar" | Out-File -encoding UTF8 -append $dest


J
JensG

Old question, new answer:

While the "old" powershell writes a BOM, the new platform-agnostic variant does behave differently: The default is "no BOM" and it can be configured via switch:

-Encoding Specifies the type of encoding for the target file. The default value is utf8NoBOM. The acceptable values for this parameter are as follows: ascii: Uses the encoding for the ASCII (7-bit) character set. bigendianunicode: Encodes in UTF-16 format using the big-endian byte order. oem: Uses the default encoding for MS-DOS and console programs. unicode: Encodes in UTF-16 format using the little-endian byte order. utf7: Encodes in UTF-7 format. utf8: Encodes in UTF-8 format. utf8BOM: Encodes in UTF-8 format with Byte Order Mark (BOM) utf8NoBOM: Encodes in UTF-8 format without Byte Order Mark (BOM) utf32: Encodes in UTF-32 format.

Source: https://docs.microsoft.com/de-de/powershell/module/Microsoft.PowerShell.Utility/Out-File?view=powershell-7 Emphasis mine


Z
Zombo

For PowerShell 5.1, enable this setting:

Control Panel, Region, Administrative, Change system locale, Use Unicode UTF-8 for worldwide language support

Then enter this into PowerShell:

$PSDefaultParameterValues['*:Encoding'] = 'Default'

Alternatively, you can upgrade to PowerShell 6 or higher.

https://github.com/PowerShell/PowerShell


To spell it out: This is a system-wide setting that makes Windows PowerShell default to BOM-less UTF-8 across all cmdlets, which may or may not be desired, not least because the feature is still in beta (as of this writing) and can break legacy console applications - see this answer for background information.
J
Jaume Suñer Mut

Change multiple files by extension to UTF-8 without BOM:

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in ls -recurse -filter "*.java") {
    $MyFile = Get-Content $i.fullname 
    [System.IO.File]::WriteAllLines($i.fullname, $MyFile, $Utf8NoBomEncoding)
}

f
frank tan
    [System.IO.FileInfo] $file = Get-Item -Path $FilePath 
    $sequenceBOM = New-Object System.Byte[] 3 
    $reader = $file.OpenRead() 
    $bytesRead = $reader.Read($sequenceBOM, 0, 3) 
    $reader.Dispose() 
    #A UTF-8+BOM string will start with the three following bytes. Hex: 0xEF0xBB0xBF, Decimal: 239 187 191 
    if ($bytesRead -eq 3 -and $sequenceBOM[0] -eq 239 -and $sequenceBOM[1] -eq 187 -and $sequenceBOM[2] -eq 191) 
    { 
        $utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False) 
        [System.IO.File]::WriteAllLines($FilePath, (Get-Content $FilePath), $utf8NoBomEncoding) 
        Write-Host "Remove UTF-8 BOM successfully" 
    } 
    Else 
    { 
        Write-Warning "Not UTF-8 BOM file" 
    }  

Source How to remove UTF8 Byte Order Mark (BOM) from a file using PowerShell


S
SATO Yusuke

If you want to use [System.IO.File]::WriteAllLines(), you should cast second parameter to String[] (if the type of $MyFile is Object[]), and also specify absolute path with $ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($MyPath), like:

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-ChildItem | ConvertTo-Csv | Set-Variable MyFile
[System.IO.File]::WriteAllLines($ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($MyPath), [String[]]$MyFile, $Utf8NoBomEncoding)

If you want to use [System.IO.File]::WriteAllText(), sometimes you should pipe the second parameter into | Out-String | to add CRLFs to the end of each line explictly (Especially when you use them with ConvertTo-Csv):

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-ChildItem | ConvertTo-Csv | Out-String | Set-Variable tmp
[System.IO.File]::WriteAllText("/absolute/path/to/foobar.csv", $tmp, $Utf8NoBomEncoding)

Or you can use [Text.Encoding]::UTF8.GetBytes() with Set-Content -Encoding Byte:

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-ChildItem | ConvertTo-Csv | Out-String | % { [Text.Encoding]::UTF8.GetBytes($_) } | Set-Content -Encoding Byte -Path "/absolute/path/to/foobar.csv"

see: How to write result of ConvertTo-Csv to a file in UTF-8 without BOM


Good pointers; suggestions/: the simpler alternative to $ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($MyPath) is Convert-Path $MyPath; if you want to ensure a trailing CRLF, simply use [System.IO.File]::WriteAllLines() even with a single input string (no need for Out-String).
N
Nader Gharibian Fard

I have the same error in the PowerShell and used this isolation and fixed it

$PSDefaultParameterValues['*:Encoding'] = 'utf8'

T
Tanmay Sarin

Used this method to edit a UTF8-NoBOM file and generated a file with correct encoding-

$fileD = "file.xml"
(Get-Content $fileD) | ForEach-Object { $_ -replace 'replace text',"new text" } | out-file "file.xml" -encoding ASCII

I was skeptical at this method at first, but it surprised me and worked!

Tested with powershell version 5.1


P
Pravanjan Hota

I would say to use just the Set-Content command, nothing else needed.

The powershell version in my system is :-

PS C:\Users\XXXXX> $PSVersionTable.PSVersion | fl


Major         : 5
Minor         : 1
Build         : 19041
Revision      : 1682
MajorRevision : 0
MinorRevision : 1682

PS C:\Users\XXXXX>

So you would need something like following.

PS C:\Users\XXXXX> Get-Content .\Downloads\finddate.txt
Thursday, June 23, 2022 5:57:59 PM
PS C:\Users\XXXXX> Get-Content .\Downloads\finddate.txt | Set-Content .\Downloads\anotherfile.txt
PS C:\Users\XXXXX> Get-Content .\Downloads\anotherfile.txt
Thursday, June 23, 2022 5:57:59 PM
PS C:\Users\XXXXX>

Now when we check the file as per the screenshot it is utf8. anotherfile.txt


E
Erik Anderson

One technique I utilize is to redirect output to an ASCII file using the Out-File cmdlet.

For example, I often run SQL scripts that create another SQL script to execute in Oracle. With simple redirection (">"), the output will be in UTF-16 which is not recognized by SQLPlus. To work around this:

sqlplus -s / as sysdba "@create_sql_script.sql" |
Out-File -FilePath new_script.sql -Encoding ASCII -Force

The generated script can then be executed via another SQLPlus session without any Unicode worries:

sqlplus / as sysdba "@new_script.sql" |
tee new_script.log

Update: As others have pointed out, this will drop non-ASCII characters. Since the user asked for a way to "force" conversion, I assume they do not care about that as perhaps their data does not contain such data.

If you care about the preservation of non-ASCII characters, this is not the answer for you.


Yes, -Encoding ASCII avoids the BOM problem, but you obviously only get support for 7-bit ASCII characters. Given that ASCII is a subset of UTF-8, the resulting file is technically also a valid UTF-8 file, but all non-ASCII characters in your input will be converted to literal ? characters.
This answer needs more votes. The sqlplus incompatibility with BOM is a cause of many headaches.
@AmitNaidu No, this is the wrong answer, because it won't work if the text has any non-ascii characters: any accents, umlauts, oriental/cryllic, etc.
@JoelCoehoorn This is a correct answer according to what the user asked. Since the user asked for a way to "force", they're not expecting any issues or don't care probably because the source doesn't use any non-ASCII characters. For those who do care about the preservation of those characters, this will not work.
R
Robin Wang

Could use below to get UTF8 without BOM

$MyFile | Out-File -Encoding ASCII

No, it will convert the output to current ANSI codepage (cp1251 or cp1252, for example). It is not UTF-8 at all!
Thanks Robin. This may not have worked for writing a UTF-8 file without the BOM but the -Encoding ASCII option removed the BOM. That way I could generate a bat file for gvim. The .bat file was tripping up on the BOM.
@ForNeVeR: You're correct that encoding ASCII is not UTF-8, but it's als not the current ANSI codepage - you're thinking of Default; ASCII truly is 7-bit ASCII encoding, with codepoints >= 128 getting converted to literal ? instances.
@ForNeVeR: You're probably thinking of "ANSI" or "extended ASCII". Try this to verify that -Encoding ASCII is indeed 7-bit ASCII only: 'äb' | out-file ($f = [IO.Path]::GetTempFilename()) -encoding ASCII; '?b' -eq $(Get-Content $f; Remove-Item $f) - the ä has been transliterated to a ?. By contrast, -Encoding Default ("ANSI") would correctly preserve it.
@rob This is the perfect answer for everybody who just doesn't need utf-8 or anything else that is different to ASCII and is not interested in understanding encodings and the purpose of unicode. You can use it as utf-8 because the equivalent utf-8 characters to all ASCII characters are identical (means converting an ASCII-file to an utf-8-file results in an identical file (if it gets no BOM)). For all who have non-ASCII characters in their text this answer is just false and misleading.
K
Krzysztof

This one works for me (use "Default" instead of "UTF8"):

$MyFile = Get-Content $MyPath
$MyFile | Out-File -Encoding "Default" $MyPath

The result is ASCII without BOM.


Per the Out-File documentation specifying the Default encoding will use the system's current ANSI code page, which is not UTF-8, as I required.
This does seem to work for me, at least for Export-CSV. If you open the resulting file in a proper editor, the file encoding is UTF-8 without BOM, and not Western Latin ISO 9 as I would have expected with ASCII
Many editors open the file as UTF-8 if they can't detect the encoding.