ChatGPT解决这个技术问题 Extra ChatGPT

Getting std :: ifstream to handle LF, CR, and CRLF?

Specifically I'm interested in istream& getline ( istream& is, string& str );. Is there an option to the ifstream constructor to tell it to convert all newline encodings to '\n' under the hood? I want to be able to call getline and have it gracefully handle all line endings.

Update: To clarify, I want to be able to write code that compiles almost anywhere, and will take input from almost anywhere. Including the rare files that have '\r' without '\n'. Minimizing inconvenience for any users of the software.

It's easy to workaround the issue, but I'm still curious as to the right way, in the standard, to flexibly handle all text file formats.

getline reads in a full line, up to a '\n', into a string. The '\n' is consumed from the stream, but getline doesn't include it in the string. That's fine so far, but there might be a '\r' just before the '\n' that gets included into the string.

There are three types of line endings seen in text files: '\n' is the conventional ending on Unix machines, '\r' was (I think) used on old Mac operating systems, and Windows uses a pair, '\r' following by '\n'.

The problem is that getline leaves the '\r' on the end of the string.

ifstream f("a_text_file_of_unknown_origin");
string line;
getline(f, line);
if(!f.fail()) { // a non-empty line was read
   // BUT, there might be an '\r' at the end now.
}

Edit Thanks to Neil for pointing out that f.good() isn't what I wanted. !f.fail() is what I want.

I can remove it manually myself (see edit of this question), which is easy for the Windows text files. But I'm worried that somebody will feed in a file containing only '\r'. In that case, I presume getline will consume the whole file, thinking that it is a single line!

.. and that's not even considering Unicode :-)

.. maybe Boost has a nice way to consume one line at a time from any text-file type?

Edit I'm using this, to handle the Windows files, but I still feel I shouldn't have to! And this won't fork for the '\r'-only files.

if(!line.empty() && *line.rbegin() == '\r') {
    line.erase( line.length()-1, 1);
}
\n means new line in whatever way that is presented in the current OS. The library takes care of that. But for that to work, a program compiled in windows should read text files from windows, a program compiled in unix, text files from unix etc.
@George, even though I'm compiling on a Linux machine, sometimes I'm using text files that came originally from a Windows machine. I might release my software (a small tool for network analysis), and I want to be able to tell users that they can feed in almost any time of (ASCII-like) text file.
Note that if(f.good()) does not do what you seem to think it does.
Thanks @Neil, I fell for that even though I checked it all a few days ago! I fully understood it then. I think I allowed myself carelessly to assume that f.good() should be the opposite of f.fail().

2
21 revs, 3 users 99%

As Neil pointed out, "the C++ runtime should deal correctly with whatever the line ending convention is for your particular platform."

However, people do move text files between different platforms, so that is not good enough. Here is a function that handles all three line endings ("\r", "\n" and "\r\n"):

std::istream& safeGetline(std::istream& is, std::string& t)
{
    t.clear();

    // The characters in the stream are read one-by-one using a std::streambuf.
    // That is faster than reading them one-by-one using the std::istream.
    // Code that uses streambuf this way must be guarded by a sentry object.
    // The sentry object performs various tasks,
    // such as thread synchronization and updating the stream state.

    std::istream::sentry se(is, true);
    std::streambuf* sb = is.rdbuf();

    for(;;) {
        int c = sb->sbumpc();
        switch (c) {
        case '\n':
            return is;
        case '\r':
            if(sb->sgetc() == '\n')
                sb->sbumpc();
            return is;
        case std::streambuf::traits_type::eof():
            // Also handle the case when the last line has no line ending
            if(t.empty())
                is.setstate(std::ios::eofbit);
            return is;
        default:
            t += (char)c;
        }
    }
}

And here is a test program:

int main()
{
    std::string path = ...  // insert path to test file here

    std::ifstream ifs(path.c_str());
    if(!ifs) {
        std::cout << "Failed to open the file." << std::endl;
        return EXIT_FAILURE;
    }

    int n = 0;
    std::string t;
    while(!safeGetline(ifs, t).eof())
        ++n;
    std::cout << "The file contains " << n << " lines." << std::endl;
    return EXIT_SUCCESS;
}

@Miek: I have updated the code following Bo Persons suggestion stackoverflow.com/questions/9188126/… and ran some tests. Everything now works as it should.
@Thomas Weller: The constructor and destructor for the sentry are executed. These do things such as thread synchronization, skipping white space and updating the stream state.
In the EOF case, what is the purpose of checking that t is empty before setting the eofbit. Shouldn't that bit be set regardless of other characters having been read in?
Yay295: The eof flag should be set, not when you reach the end of the last line, but when you attempt to read beyond the last line. The check makes sure that this happens when the last line has no EOL. (Try removing the check, and then run the test program on text file where the last line has no EOL, and you will see.)
This also reads an empty last line, which is not the behavior of std::get_line which ignores a empty last line. I used the following code in the eof case to emulate the std::get_line behavior: is.setstate(std::ios::eofbit); if (t.empty()) is.setstate(std::ios::badbit); return is;
佚名

The C++ runtime should deal correctly with whatever the endline convention is for your particular platform. Specifically, this code should work on all platforms:

#include <string>
#include <iostream>
using namespace std;

int main() {
    string line;
    while( getline( cin, line ) ) {
        cout << line << endl;
    }
}

Of course, if you are dealing with files from another platform, all bets are off.

As the two most common platforms (Linux and Windows) both terminate lines with a newline character, with Windows preceding it with a carriage return,, you can examine the last character of the line string in the above code to see if it is \r and if so remove it before doing your application-specific processing.

For example, you could provide yourself with a getline style function that looks something like this (not tested, use of indexes, substr etc for pedagogical purposes only):

ostream & safegetline( ostream & os, string & line ) {
    string myline;
    if ( getline( os, myline ) ) {
       if ( myline.size() && myline[myline.size()-1] == '\r' ) {
           line = myline.substr( 0, myline.size() - 1 );
       }
       else {
           line = myline;
       }
    }
    return os;
}

The question is about how to deal with files from another platform.
@Neil, this answer isn't sufficient yet. If I justed wanted to handle CRLFs, I wouldn't have come to StackOverflow. The real challenge is to handle the files which only have '\r'. They're pretty rare nowadays, now that MacOS has moved closer to Unix, but I don't want to assume they will never be fed to my software.
@Aaron well, if you want to be able to handle ANYTHING you have to write your own code to do it.
I made clear in my question from the start that it is easy to workaround this, implying that I am willing and able to do so. I asked about this because it seems to be such a common question, and there are a variety of text-file formats. I assumed/hoped that the C++ standards committee had built this in. This was my question.
@Neil, I think there's another issue I/we have forgotten. But first, I accept that it's practical for me to identify a small number of formats to be supported. Therefore, I want code that will compile on Windows and Linux and which will work with either format. Your safegetline is an important part of a solution. But if this program is being compiled on Windows, will I also need to open the file in binary format? Do Windows compilers (in text mode) allow '\n' to behave like '\r''\n'? ifstream f("f.txt", ios_base :: binary | ios_base::in );
b
bouvierr

Are you reading the file in BINARY or in TEXT mode? In TEXT mode the pair carriage return/line feed, CRLF, is interpreted as TEXT end of line, or end of line character, but in BINARY you fetch only ONE byte at a time, which means that either character MUST be ignored and left in the buffer to be fetched as another byte! Carriage return means, in the typewriter, that the typewriter car, where the printing arm lies in, has reached the right edge of the paper and is returned to the left edge. This is a very mechanical model, that of the mechanical typewriter. Then the line feed means that the paper roll is rotated a little bit up so the paper is in position to begin another line of typing. As fas as I remember one of the low digits in ASCII means move to the right one character without typing, the dead char, and of course \b means backspace: move the car one character back. That way you can add special effects, like underlying (type underscore), strikethrough (type minus), approximate different accents, cancel out (type X), without needing an extended keyboard, just by adjusting the position of the car along the line before entering the line feed. So you can use byte sized ASCII voltages to automatically control a typewriter without a computer in between. When the automatic typewriter is introduced, AUTOMATIC means that once you reach the farthest edge of the paper, the car is returned to the left AND the line feed applied, that is, the car is assumed to be returned automatically as the roll moves up! So you do not need both control characters, only one, the \n, new line, or line feed.

This has nothing to do with programming but ASCII is older and HEY! looks like some people were not thinking when they begun doing text things! The UNIX platform assumes an electrical automatic typemachine; the Windows model is more complete and allows for control of mechanical machines, though some control characters become less and less useful in computers, like the bell character, 0x07 if I remember well... Some forgotten texts must have been originally captured with control characters for electrically controlled typewriters and it perpetuated the model...

Actually the correct variation would be to just include the \r, line feed, the carriage return being unnecessary, that is, automatic, hence:

char c;
ifstream is;
is.open("",ios::binary);
...
is.getline(buffer, bufsize, '\r');

//ignore following \n or restore the buffer data
if ((c=is.get())!='\n') is.rdbuf()->sputbackc(c);
...

would be the most correct way to handle all types of files. Note however that \n in TEXT mode is actually the byte pair 0x0d 0x0a, but 0x0d IS just \r: \n includes \r in TEXT mode but not in BINARY, so \n and \r\n are equivalent... or should be. This is a very basic industry confusion actually, typical industry inertia, as the convention is to speak of CRLF, in ALL platforms, then fall into different binary interpretations. Strictly speaking, files including ONLY 0x0d (carriage return) as being \n (CRLF or line feed), are malformed in TEXT mode (typewritter machine: just return the car and strikethrough everything...), and are a non-line oriented binary format (either \r or \r\n meaning line oriented) so you are not supposed to read as text! The code ought to fail maybe with some user message. This does not depend on the OS only, but also on the C library implementation, adding to the confusion and possible variations... (particularly for transparent UNICODE translation layers adding another point of articulation for confusing variations).

The problem with the previous code snippet (mechanical typewriter) is that it is very inefficient if there are no \n characters after \r (automatic typewriter text). Then it also assumes BINARY mode where the C library is forced to ignore text interpretations (locale) and give away the sheer bytes. There should be no difference in the actual text characters between both modes, only in the control characters, so generally speaking reading BINARY is better than TEXT mode. This solution is efficient for BINARY mode typical Windows OS text files independently of C library variations, and inefficient for other platform text formats (including web translations into text). If you care about efficiency, the way to go is to use a function pointer, make a test for \r vs \r\n line controls however way you like, then select the best getline user-code into the pointer and invoke it from it.

Incidentally I remember I found some \r\r\n text files too... which translates into double line text just as is still required by some printed text consumers.


+1 for the "ios::binary" - sometimes, you actually want to read the file as it is (e.g. for calculating a checksum etc.) without the runtime changing the line endings.
u
user2061057

One solution would be to first search and replace all line endings to '\n' - just like e.g. Git does by default.


佚名

Other than writing your own custom handler or using an external library, you are out of luck. The easiest thing to do is to check to make sure line[line.length() - 1] is not '\r'. On Linux, this is superfluous as most lines will end up with '\n', meaning you'll lose a fair bit of time if this is in a loop. On Windows, this is also superfluous. However, what about classic Mac files which end in '\r'? std::getline would not work for those files on Linux or Windows because '\n' and '\r' '\n' both end with '\n', eliminating the need to check for '\r'. Obviously such a task that works with those files would not work well. Of course, then there exist the numerous EBCDIC systems, something that most libraries won't dare tackle.

Checking for '\r' is probably the best solution to your problem. Reading in binary mode would allow you to check for all three common line endings ('\r', '\r\n' and '\n'). If you only care about Linux and Windows as old-style Mac line endings shouldn't be around for much longer, check for '\n' only and remove the trailing '\r' character.


M
Martin Thümmel

If it is known how many items/numbers each line has, one could read one line with e.g. 4 numbers as

string num;
is >> num >> num >> num >> num;

This also works with other line endings.


G
Gergely Nagy

Unfortunately the accepted solution does not behave exactly like std::getline(). To obtain that behavior (to my tests), the following change is necessary:

std::istream& safeGetline(std::istream& is, std::string& t)
{
    t.clear();

    // The characters in the stream are read one-by-one using a std::streambuf.
    // That is faster than reading them one-by-one using the std::istream.
    // Code that uses streambuf this way must be guarded by a sentry object.
    // The sentry object performs various tasks,
    // such as thread synchronization and updating the stream state.

    std::istream::sentry se(is, true);
    std::streambuf* sb = is.rdbuf();

    for(;;) {
        int c = sb->sbumpc();
        switch (c) {
        case '\n':
            return is;
        case '\r':
            if(sb->sgetc() == '\n')
                sb->sbumpc();
            return is;
        case std::streambuf::traits_type::eof():
            is.setstate(std::ios::eofbit);       //
            if(t.empty())                        // <== change here
                is.setstate(std::ios::failbit);  // 
            return is;
        default:
            t += (char)c;
        }
    }
}

According to https://en.cppreference.com/w/cpp/string/basic_string/getline:

Extracts characters from input and appends them to str until one of the following occurs (checked in the order listed) end-of-file condition on input, in which case, getline sets eofbit. the next available input character is delim, as tested by Traits::eq(c, delim), in which case the delimiter character is extracted from input, but is not appended to str. str.max_size() characters have been stored, in which case getline sets failbit and returns. If no characters were extracted for whatever reason (not even the discarded delimiter), getline sets failbit and returns.