
Understanding PEFF Format for Protein Sequence Databases
Explore the PEFF format, a unified format for protein sequence databases used by sequence search engines. Learn how PEFF enables consistent extraction, display, and processing of information such as post-translational modifications, mutations, and other processing events. Discover the status of PEFF after 10 years in the making, mapping sequences to PEFF, and the new tools being developed for indexing databases and mapping peptides.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Indexing FASTA and PEFF files Luis Mendoza
What is PEFF? PEFF = PSI Extended Fasta Format unified format for protein sequence databases to be used by sequence search engines and other associated tools This format enables consistent extraction, display and processing of information such as post-translational modifications, mutations and other processing events. Plain text, largely FASTA-like for backwards compatibility 2
What does PEFF look like? Unique ID Description / More Info >sp|Q5EE01|CENPW_HUMAN Centromere protein W OS=Homo sapiens GN=CENPW PE=1 SV=1 MALSTIVSQRKQIKRKAPRGFLKRVFKRKKPQLRLEKSGDLLVH LNCLLFVHRLAEESRTNACASKCRVINKEHVLAAAKVILKKSRG >sp|Q53EZ4|CEP55_HUMAN ... ... Sequence Standard FASTA Unique ID \ Keywords >nxp:NX_Q5EE01-1 \DbUniqueId=NX_Q5EE01-1 \PName=Centromere protein W isoform Iso 1 \GName=CENPW \NcbiTaxId=9606 \TaxName=Homo Sapiens \Length=88 \SV=61 \EV=265 \PE=1 \VariantSimple=(4|L)(6|M)(6|V)(8|P)(8|F)(11|R)(19|H)(19|C)(20|D)(24|Q) (28|L)(28|P)(31|R)(32|*)(40|N)(41|F)(45|V)(47|F)(52|R)(53|*)(53|Q)(57|D)(59|G) (63|F)(64|V)(12|H)(26|C)(62|T)(63|S)(74|R)(78|T)(80|M)(86|I)(86|G) \Processed=(1|88|mature protein) MALSTIVSQRKQIKRKAPRGFLKRVFKRKKPQLRLEKSGDLLVH LNCLLFVHRLAEESRTNACASKCRVINKEHVLAAAKVILKKSRG >nxp:NX_Q5EE01-1 \DbUniqueId=NX_Q5EE01-1 ... ... ... >nxp:NX_Q5EE01-1 \DbUniqueId=NX_Q5EE01-1 \PName=Centromere protein W isoform Iso 1 \GName=CENPW \NcbiTaxId=9606 \TaxName=Homo Sapiens \Length=88 \SV=61 \EV=265 \PE=1 \VariantSimple=(4|L)(6|M)(6|V)(8|P)(8|F)(11|R)(19|H)(19|C)(20|D)(24|Q) (28|L)(28|P)(31|R)(32|*)(40|N)(41|F)(45|V)(47|F)(52|R)(53|*)(53|Q)(57|D)(59|G) (63|F)(64|V)(12|H)(26|C)(62|T)(63|S)(74|R)(78|T)(80|M)(86|I)(86|G) \Processed=(1|88|mature protein) MALSTIVSQRKQIKRKAPRGFLKRVFKRKKPQLRLEKSGDLLVH LNCLLFVHRLAEESRTNACASKCRVINKEHVLAAAKVILKKSRG >nxp:NX_Q5EE01-1 \DbUniqueId=NX_Q5EE01-1 ... Sequence PEFF 3
PEFF Keywords Single amino acid substitutions Insertions Deletions Amino acid mass modifications 4
PEFF Status 10 years in the making The specification is nearly complete and almost ready to enter the PSI Document Process for formal review. Can export from neXtProt Or generate your own! Format validator at PeptideAtlas Search with COMET http://www.psidev.info/peff 5
Mapping Sequences to PEFF Traditional method (RefreshParser) is not suited for this problem Also need a tool that can: determine if a peptide sequence is uniquely mapping to a protein (proteotypic) do fuzzy mapping, where a portion of the peptide sequence is unknown New tools to index databases and map peptides In development 6
Indexing Basics :: FASTA >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens GN=YWHAB PE=1 SV=3 MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRS SWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKV FYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFS VFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQG DEGDAGEGEN ... I. II. Pick segment size (s=5) For each protein: 1. Generate accession alias: 1::sp|P31946|1433B_HUMAN 2. Record position of each s-length segment: MTMDK:1,1 TMDKS:1,2 MDKSE:1,3 DKSEL:1,4 Repeat III. 7
PEFF Extensions >nxp:NX_Q5EE01-2 \DbUniqueId=NX_Q5EE01-2 \PName=Centromere protein W isoform Iso 2 \GName=CENPW \NcbiTaxId=9606 \TaxName=Homo Sapiens \Length=103 \SV=26 \EV=265 \PE=1 \VariantSimple=(4|L)(6|M)(6|V)(8|P)(8|F)(11|R)(19|H)(19|C)(20|D)(24|Q) (28|L)(28|P)(31|R)(32|*)(40|N)(41|F)(47|S)(49|W)(49|L)(55|M)(56|E)(60|V)(62|F) (67|R)(68|*)(68|Q)(72|D)(74|G)(78|F)(79|V)(12|H)(26|C)(77|T)(78|S)(89|R)(93|T) (95|M)(101|I)(101|G) \Processed=(1|103|mature protein) MALSTIVSQRKQIKRKAPRGFLKRVFKRKKPQLRLEKSGDLLVRFHPFSGWE WGTGEVHLNCLLFVHRLAEESRTNACASKCRVINKEHVLAAAKVILKKSRG Generate all possible combinations, e.g. position 4: \VariantSimple=(4|L)(6|M)(6|V)(8|P)(8|F)(11|R)... M A L S T I V S Q R K Q . . . L M P R V F STIVS STIVS STMVS STMVS STVVS STVVS LTIVS LTIVS LTMVS LTVVS STIVP STIVP STMVP STVVP LTIVP LTMVP LTVVP STIVF STIVF STMVF STVVF LTIVF LTMVF LTVVF 8
Segment Index Examples ... MTMDF:39998,110:16987,256:9989,409:8055,409:14974,409:9966,32 MTMDG:21245,214:13297,551:6850,771:6946,765:8252,691:8222,691:6761,771:7289,742: 15577,216:19017,216:35319,128:23400,175:423,2611:301,247:322,247:3444,574: 3912,512:3971,574:4221,574:3967,574:30954,1:34487,1 MTMDH:27409,118:27587,122:39998,110:1665,1345:11991,577:12010,577:13725,524: 21476,137 MTMDK:7427,745:6034,827:6281,811:6489,797:7277,758:4127,191:5082,114:3545,419: 4188,338:22806,160:25961,160:25696,160:25125,121:28722,121:28431,121:31188,28: 34399,28:34078,28:19330,250:3857,177:15577,216:19017,216:35319,128:23400,175: 423,2611:32146,1:16085,424:25829,66:8696,23:13,1180:270,2935:288,2935:279,2896: 299,2935:24862,125 MTMDL:12635,443:31377,61:31820,61:37694,52:37694,113:40576,1:40576,62:36955,52: 36955,113:10307,481:29218,272:30942,243:4939,562:14147,184:355,2484:330,2484: 4315,630:5236,562:8580,183:9242,183:14962,341:20665,223:8448,12:818,1774: 1168,1774:844,1755:768,1774:9015,688:19571,269:24963,269:21512,335:21868,329: 21822,330:21489,336:6624,340:12762,340:21476,137:28584,273 ... 9
New Indexing Tool :: indexPEFF -------------------------------------------------------------------------- Program: ./indexPEFF.pl Purpose: Generates protein sequence index file by use of segments, for use in mapping observed peptide sequences to all proteins. Works with any protein file in FASTA format, including PEFF. Usage: ./indexPEFF.pl [options] <fasta_file> Generates: <fasta_file>.pep.idx Options: -s <length> segment size, in number of aminoacids [default=5] -V do not use PEFF variants [default:use them] -f force overwriting of index file, if exists -I do NOT convert I->L -A do not generate all possible keys in index (not recommended for large files) For Developers: -D print debug information -------------------------------------------------------------------------- 10
Output Index File Structure: Header Segments Offset (meta-index) Protein Aliases Segment Index 11
Index File :: Header Header Segment offset (meta-index) Protein Aliases Segment Index General information Segment size Variants and subs used # Index generated by ./indexPEFF.pl # Date=Fri Jan 12 19:50:51 2018 # OriginalFile=nextprot_all_updatedTo1.0h.peff # AASub=I->L # KeyLength=5 # PEFFVariants=VariantSimple # NumProteins=42164 # NumSegments=2459702 # BeginSegmentOffsets AA::0000965328 ... 12
Index File :: Segments Offset Header Segment offset (meta-index) Protein Aliases Segment Index In-file byte offsets to landmark positions Enables faster retrieval of segment entries Especially beneficial for single-peptide lookups # BeginSegmentOffsets AA::0000965328 AC::0009652271 AD::0012957829 AE::0017719328 ... YV::1837716365 YW::1841061692 YY::1841991570 # BeginProteins 39793::nxp:NX_A0A075B6H9-1 ... 13
Index File :: Aliases Header Segment offset (meta-index) Protein Aliases Segment Index Alias is equal to length-based rank Longer proteins generate more segments Saves space, memory # BeginProteins 39793::nxp:NX_A0A075B6H9-1 39680::nxp:NX_A0A075B6I0-1 39742::nxp:NX_A0A075B6I1-1 39920::nxp:NX_A0A075B6I4-1 40029::nxp:NX_A0A075B6I9-1 ... 32311::nxp:NX_W5XKT8-2 34284::nxp:NX_W5XKT8-3 # BeginIndex AAAAA: ... ... 14
Index File :: The Index! Header Segment offset (meta-index) Protein Aliases Segment Index Alphabetical! Full keys (even when empty) enable faster lookup AAAAA:... ... MTMWS:19928,120:19928,182:12016,1:13295,1:33371,69 MTMWT:18258,3 MTMWV:4261,517:11966,55:24847,187:16415,178:16645,178:48 73,178:1845,894:37535,106:33301,175:13785,417:18836,23:1 8427,23:16522,23:33533,92:34564,92 MTMWW: MTMWY:442,1828:447,1820 MTMYA:38090,31:36746,31:12457,422:26036,142:21738,164:21 967,160:25335,276:26650,255:28748,227 ... YYYYY:... 15
Considerations and Limitations Must pre-generate index before mapping peptide sequences Can be time-consuming (several minutes / file) Can take up significant system memory Must create one per database/FASTA file and options Variants / no variants Include decoys! I/L cannot be changed once index built Might consider lookup option via more than one index Skips B, J, O, U, X, Z (and any other non-AA chars ) Variant position(s) not included yet. Only capturing VariantSimple. Others pending ? Is it overkill to include all possible variants in each segment? 16
Indexing :: Performance (updated) File Variants? #entries Orig. Size Index Size #Files Elapsed Time Human SwissProt + decoys 40,408 24 Mb 179 Mb 214 Mb 1 37.64 N (pepx 7) (previous) 718 Mb 1.2 Gb 2 8 44.11 18:53.62 neXtProt Human 42,196 125 Mb 1.5 Gb 1 12:29:85 Y (pepx 7)* 3.7* Gb >40k! 4:07:91* neXtProt Human 42,196 125 Mb 192 Mb 1 36.48 N (pepx 7) 464 Mb 2 31:35 17
Sequence Mapping Using Segment Indices GARRYLLKEKEYLLME GARRYLIKEKEYLIME # Portion Example YLLME LKEKE ARRYL GARRY 220 7693,634:11657,459:19587,265:15246,330:30506,3: 22806,288 -5 = 283 766 6328,82:15422,385:17310,385:21220,306: 22806,283 -5 = 278 153 19062,4:20205,4:16614,507:16703,504:33400,21: 22806,278 -1 = 277 84 62,885:7418,84:8291,84:247,2339:245,2344: 22806,277 match! I. Read segment size (s=5) and AA subs (I L) from index II. Split peptide into segments; start from the end, and include beginning III. Extract segment entries from index IV. Match based on protein and position (use appropriate shift!) V. Look up protein(s) from alias list (22806::nxp:NX_P04637-1) 18
Potential Sequence Mis-mapping GARRYAWAY GARRYAWAY # Portion Example YAWAY GARRY 12 8136,381:22806 22806,273:25961,273:25696,273: 22806,273 -4 = 269 * 84 22806 22806,172:22806 22806,189:22806 22806,277:22806 22806,293:62,885: 22806,277 NO match * pepx returns a potential match in this case; an extra step is needed to verify reported matches against protein entry in database. 19
New Peptide Mapping Tool :: mapPeptides ---------------------------------------------------------------------------------- Program: ./mapPeptides.pl Purpose: Maps peptide sequences to all proteins using indexed segments. Specify either a single <peptide> (use X for wildcard), a list of peptides (one per line) contained in <file>, or a <pepxml_file>. Usage: ./mapPeptides.pl [options] <fasta_file> <peptide>|<file>|<pepxml_file> Requires: File <fasta_file>.pep.idx (unless using -i option) Options: -U omit UNMAPPED sequences from report -u convert input sequences to uppercase -o <fmt> output format, one of: text, tsv, pepx [default:tsv] -f <num> fuzzy sequence mapping, with <num> unknown aminoacids[max:3] note: for <num>=3, only consecutive AAs are considered -m <tol> only consider "fuzzy" peptides of mass within +/- <tol> of original peptide -t <num> number of threads to use for faster processing [default:1] -i input file is index file, not source fasta For Developers: -z print performance metrics -D print debug information ---------------------------------------------------------------------------------- 20
mapPeptides :: Features Reads pertinent info from index to build segments, AA-subs, protein aliases, and byte offsets Can use any index other than default (-i option) In batch mode, all peptide segments are computed first, before index lookup Saves time, as some/many might be shared Requires only a single pass through index file Can read from command-line, peptide list file, or pepXML Wildcard and fuzzy matching, including mass tolerance 21
mapPeptides :: Wildcards Use X to designate a wildcard (single peptide mode only) Peptide is expanded into all possible combinations, and then mapped LETTERX LETTERA LETTERC LETTERD LETTERE LETTERF LETTERG LETTERH LETTERK LETTERL LETTERM LETTERN LETTERP LETTERQ LETTERR LETTERS LETTERT LETTERV LETTERW LETTERY 22
mapPeptides :: Fuzzy Matching Find wildcards in f unspecified positions As before, peptide is expanded into all possible combinations, and then mapped Can specify up to f=3, but in this case all are consecutive Otherwise suffer combinatorial wrath! Helpful for finding mis-mapping due to transposed AAs f=1 f=2 AXXGNED AXIXNED AXIGXED AXIGNXD AXIGNEX f=3 . XXXGNED AXXXNED ALXXXED ALIXXXD ALIGXXX ALIGNED XLIGNED XXIGNED XLXGNED XLIXNED XLIGXED XLIGNXD XLIGNEX AXIGNED ALXGNED ALIXNED ALIGXED ALIGNXD ALIGNEX etc 23
mapPeptides :: Mass Constraints Fuzzy/wildcard matching can return MANY results Option to map only to isobaric sequences, within specified tolerance (-m option) In the works: also consider common PTMs - -t 3 t 3 Match type #peptides #segments #mappings Elapsed time ATLASLIKE ATLASLIKE 1 2 5 00.22 -f 1 163 182 208 00.89 -f 2 11,827 6,656 13,047 15.22 08.32 -f 2 -m 0.1 133 161 180 00.94 -f 2 -m 0.01 45 64 58 00.68 -f 3 45,847 39,672 31,870 50.38 24.07 -f 3 -m 0.1 264 311 290 01.98 -f 3 -m 0.01 57 77 33* 01.16 24 * ATL == SVV , TAL
mapPeptides :: Mass Constraints (2) Fuzzy/wildcard matching can return MANY results Option to map only to isobaric sequences, within specified tolerance (-m option) In the works: also consider common PTMs - -t 3 t 3 Match type #peptides #segments #mappings Elapsed time SRMATLASLIVE SRMATLASLIVE 1 3 9 00.22 -f 1 217 273 720 01.19 -f 2 21,601 9,987 26,292 27.53 13.99 -f 2 -m 0.1 281 312 504* 01.61 -f 2 -m 0.01 106 151 261* 01.15 -f 3 65,341 58,767 29,637* 1:06.89 32.15 -f 3 -m 0.1 472 665 315* 03.24 -f 3 -m 0.01 94 145 117* 01.83 25 * No new protein families
Work in Progress PepXML output NTT/NMC Mark position and number of variants in each segment AA insertions / deletions Integrate basic mods (+ user-specified?) Front-end on TPP PeptideAtlas? Publish... 26